| <?xml version="1.0"?> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <document url="stat.html"> |
| <properties> |
| <title>The Commons Math User Guide - Statistics</title> |
| </properties> |
| <body> |
| <section name="1 Statistics"> |
| <subsection name="1.1 Overview"> |
| <p> |
| The statistics package provides frameworks and implementations for |
| basic Descriptive statistics, frequency distributions, bivariate regression, |
| and t-, chi-square and ANOVA test statistics. |
| </p> |
| <p> |
| <a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br/> |
| <a href="#a1.3_Frequency_distributions">Frequency distributions</a><br/> |
| <a href="#a1.4_Simple_regression">Simple Regression</a><br/> |
| <a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br/> |
| <a href="#a1.6_Rank_transformations">Rank transformations</a><br/> |
| <a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br/> |
| <a href="#a1.8_Statistical_tests">Statistical Tests</a><br/> |
| </p> |
| </subsection> |
| <subsection name="1.2 Descriptive statistics"> |
| <p> |
| The stat package includes a framework and default implementations for |
| the following Descriptive statistics: |
| <ul> |
| <li>arithmetic and geometric means</li> |
| <li>variance and standard deviation</li> |
| <li>sum, product, log sum, sum of squared values</li> |
| <li>minimum, maximum, median, and percentiles</li> |
| <li>skewness and kurtosis</li> |
| <li>first, second, third and fourth moments</li> |
| </ul> |
| </p> |
| <p> |
| With the exception of percentiles and the median, all of these |
| statistics can be computed without maintaining the full list of input |
| data values in memory. The stat package provides interfaces and |
| implementations that do not require value storage as well as |
| implementations that operate on arrays of stored values. |
| </p> |
| <p> |
| The top level interface is |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/UnivariateStatistic.html"> |
| UnivariateStatistic</a>. |
| This interface, implemented by all statistics, consists of |
| <code>evaluate()</code> methods that take double[] arrays as arguments |
| and return the value of the statistic. This interface is extended by |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/StorelessUnivariateStatistic.html"> |
| StorelessUnivariateStatistic</a>, which adds <code>increment(),</code> |
| <code>getResult()</code> and associated methods to support |
| "storageless" implementations that maintain counters, sums or other |
| state information as values are added using the <code>increment()</code> |
| method. |
| </p> |
| <p> |
| Abstract implementations of the top level interfaces are provided in |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/AbstractUnivariateStatistic.html"> |
| AbstractUnivariateStatistic</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/AbstractStorelessUnivariateStatistic.html"> |
| AbstractStorelessUnivariateStatistic</a> respectively. |
| </p> |
| <p> |
| Each statistic is implemented as a separate class, in one of the |
| subpackages (moment, rank, summary) and each extends one of the abstract |
| classes above (depending on whether or not value storage is required to |
| compute the statistic). There are several ways to instantiate and use statistics. |
| Statistics can be instantiated and used directly, but it is generally more convenient |
| (and efficient) to access them using the provided aggregates, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/DescriptiveStatistics.html"> |
| DescriptiveStatistics</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/SummaryStatistics.html"> |
| SummaryStatistics.</a> |
| </p> |
| <p> |
| <code>DescriptiveStatistics</code> maintains the input data in memory |
| and has the capability of producing "rolling" statistics computed from a |
| "window" consisting of the most recently added values. |
| </p> |
| <p> |
| <code>SummaryStatistics</code> does not store the input data values |
| in memory, so the statistics included in this aggregate are limited to those |
| that can be computed in one pass through the data without access to |
| the full array of values. |
| </p> |
| <p> |
| <table> |
| <tr><th>Aggregate</th><th>Statistics Included</th><th>Values stored?</th> |
| <th>"Rolling" capability?</th></tr><tr><td> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/DescriptiveStatistics.html"> |
| DescriptiveStatistics</a></td><td>min, max, mean, geometric mean, n, |
| sum, sum of squares, standard deviation, variance, percentiles, skewness, |
| kurtosis, median</td><td>Yes</td><td>Yes</td></tr><tr><td> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/SummaryStatistics.html"> |
| SummaryStatistics</a></td><td>min, max, mean, geometric mean, n, |
| sum, sum of squares, standard deviation, variance</td><td>No</td><td>No</td></tr> |
| </table> |
| </p> |
| <p> |
| <code>SummaryStatistics</code> can be aggregated using |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/AggregateSummaryStatistics.html"> |
| AggregateSummaryStatistics.</a> This class can be used to concurrently |
| gather statistics for multiple datasets as well as for a combined sample |
| including all of the data. |
| </p> |
| <p> |
| <code>MultivariateSummaryStatistics</code> is similar to |
| <code>SummaryStatistics</code> but handles n-tuple values instead of |
| scalar values. It can also compute the full covariance matrix for the |
| input data. |
| </p> |
| <p> |
| Neither <code>DescriptiveStatistics</code> nor <code>SummaryStatistics</code> |
| is thread-safe. |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/SynchronizedDescriptiveStatistics.html"> |
| SynchronizedDescriptiveStatistics</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/SynchronizedSummaryStatistics.html"> |
| SynchronizedSummaryStatistics</a>, respectively, provide thread-safe |
| versions for applications that require concurrent access to statistical |
| aggregates by multiple threads. |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/SynchronizedMultiVariateSummaryStatistics.html"> |
| SynchronizedMultivariateSummaryStatistics</a> provides thread-safe |
| <code>MultivariateSummaryStatistics.</code> |
| </p> |
| <p> |
| There is also a utility class, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/StatUtils.html"> |
| StatUtils</a>, that provides static methods for computing statistics |
| directly from double[] arrays. |
| </p> |
| <p> |
| Here are some examples showing how to compute Descriptive statistics. |
| <dl> |
| <dt>Compute summary statistics for a list of double values</dt> |
| <br/> |
| <dd>Using the <code>DescriptiveStatistics</code> aggregate |
| (values are stored in memory): |
| <source> |
| // Get a DescriptiveStatistics instance |
| DescriptiveStatistics stats = new DescriptiveStatistics(); |
| |
| // Add the data from the array |
| for( int i = 0; i < inputArray.length; i++) { |
| stats.addValue(inputArray[i]); |
| } |
| |
| // Compute some statistics |
| double mean = stats.getMean(); |
| double std = stats.getStandardDeviation(); |
| double median = stats.getPercentile(50); |
| </source> |
| </dd> |
| <dd>Using the <code>SummaryStatistics</code> aggregate (values are |
| <strong>not</strong> stored in memory): |
| <source> |
| // Get a SummaryStatistics instance |
| SummaryStatistics stats = new SummaryStatistics(); |
| |
| // Read data from an input stream, |
| // adding values and updating sums, counters, etc. |
| while (line != null) { |
| line = in.readLine(); |
| stats.addValue(Double.parseDouble(line.trim())); |
| } |
| in.close(); |
| |
| // Compute the statistics |
| double mean = stats.getMean(); |
| double std = stats.getStandardDeviation(); |
| //double median = stats.getMedian(); <-- NOT AVAILABLE |
| </source> |
| </dd> |
| <dd>Using the <code>StatUtils</code> utility class: |
| <source> |
| // Compute statistics directly from the array |
| // assume values is a double[] array |
| double mean = StatUtils.mean(values); |
| double std = FastMath.sqrt(StatUtils.variance(values)); |
| double median = StatUtils.percentile(values, 50); |
| |
| // Compute the mean of the first three values in the array |
| mean = StatUtils.mean(values, 0, 3); |
| </source> |
| </dd> |
| <dt>Maintain a "rolling mean" of the most recent 100 values from |
| an input stream</dt> |
| <br/> |
| <dd>Use a <code>DescriptiveStatistics</code> instance with |
| window size set to 100 |
| <source> |
| // Create a DescriptiveStatistics instance and set the window size to 100 |
| DescriptiveStatistics stats = new DescriptiveStatistics(); |
| stats.setWindowSize(100); |
| |
| // Read data from an input stream, |
| // displaying the mean of the most recent 100 observations |
| // after every 100 observations |
| long nLines = 0; |
| while (line != null) { |
| line = in.readLine(); |
| nLines++; |
| stats.addValue(Double.parseDouble(line.trim())); |
| if (nLines == 100) { |
| nLines = 0; |
| System.out.println(stats.getMean()); |
| } |
| } |
| in.close(); |
| </source> |
| </dd> |
| <dt>Compute statistics in a thread-safe manner</dt> |
| <br/> |
| <dd>Use a <code>SynchronizedDescriptiveStatistics</code> instance |
| <source> |
| // Create a SynchronizedDescriptiveStatistics instance and |
| // use as any other DescriptiveStatistics instance |
| DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics(); |
| </source> |
| </dd> |
| <dt>Compute statistics for multiple samples and overall statistics concurrently</dt> |
| <br/> |
| <dd>There are two ways to do this using <code>AggregateSummaryStatistics.</code> |
| The first is to use an <code>AggregateSummaryStatistics</code> instance |
| to accumulate overall statistics contributed by <code>SummaryStatistics</code> |
| instances created using |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/AggregateSummaryStatistics.html#createContributingStatistics()"> |
| AggregateSummaryStatistics.createContributingStatistics()</a>: |
| <source> |
| // Create a AggregateSummaryStatistics instance to accumulate the overall statistics |
| // and AggregatingSummaryStatistics for the subsamples |
| AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics(); |
| SummaryStatistics setOneStats = aggregate.createContributingStatistics(); |
| SummaryStatistics setTwoStats = aggregate.createContributingStatistics(); |
| // Add values to the subsample aggregates |
| setOneStats.addValue(2); |
| setOneStats.addValue(3); |
| setTwoStats.addValue(2); |
| setTwoStats.addValue(4); |
| ... |
| // Full sample data is reported by the aggregate |
| double totalSampleSum = aggregate.getSum(); |
| </source> |
| The above approach has the disadvantages that the <code>addValue</code> calls must be synchronized on the |
| <code>SummaryStatistics</code> instance maintained by the aggregate and each value addition updates the |
| aggregate as well as the subsample. For applications that can wait to do the aggregation until all values |
| have been added, a static |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/AggregateSummaryStatistics.html#aggregate(java.util.Collection)"> |
| aggregate</a> method is available, as shown in the following example. |
| This method should be used when aggregation needs to be done across threads. |
| <source> |
| // Create SummaryStatistics instances for the subsample data |
| SummaryStatistics setOneStats = new SummaryStatistics(); |
| SummaryStatistics setTwoStats = new SummaryStatistics(); |
| // Add values to the subsample SummaryStatistics instances |
| setOneStats.addValue(2); |
| setOneStats.addValue(3); |
| setTwoStats.addValue(2); |
| setTwoStats.addValue(4); |
| ... |
| // Aggregate the subsample statistics |
| Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>(); |
| aggregate.add(setOneStats); |
| aggregate.add(setTwoStats); |
| StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate); |
| |
| // Full sample data is reported by aggregatedStats |
| double totalSampleSum = aggregatedStats.getSum(); |
| </source> |
| </dd> |
| </dl> |
| </p> |
| </subsection> |
| <subsection name="1.3 Frequency distributions"> |
| <p> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/Frequency.html"> |
| Frequency</a> |
| provides a simple interface for maintaining counts and percentages of discrete |
| values. |
| </p> |
| <p> |
| Strings, integers, longs and chars are all supported as value types, |
| as well as instances of any class that implements <code>Comparable.</code> |
| The ordering of values used in computing cumulative frequencies is by |
| default the <i>natural ordering,</i> but this can be overridden by supplying a |
| <code>Comparator</code> to the constructor. Adding values that are not |
| comparable to those that have already been added results in an |
| <code>IllegalArgumentException.</code> |
| </p> |
| <p> |
| Here are some examples. |
| <dl> |
| <dt>Compute a frequency distribution based on integer values</dt> |
| <br/> |
| <dd>Mixing integers, longs, Integers and Longs: |
| <source> |
| Frequency f = new Frequency(); |
| f.addValue(1); |
| f.addValue(new Integer(1)); |
| f.addValue(new Long(1)); |
| f.addValue(2); |
| f.addValue(new Integer(-1)); |
| System.out.prinltn(f.getCount(1)); // displays 3 |
| System.out.println(f.getCumPct(0)); // displays 0.2 |
| System.out.println(f.getPct(new Integer(1))); // displays 0.6 |
| System.out.println(f.getCumPct(-2)); // displays 0 |
| System.out.println(f.getCumPct(10)); // displays 1 |
| </source> |
| </dd> |
| <dt>Count string frequencies</dt> |
| <br/> |
| <dd>Using case-sensitive comparison, alpha sort order (natural comparator): |
| <source> |
| Frequency f = new Frequency(); |
| f.addValue("one"); |
| f.addValue("One"); |
| f.addValue("oNe"); |
| f.addValue("Z"); |
| System.out.println(f.getCount("one")); // displays 1 |
| System.out.println(f.getCumPct("Z")); // displays 0.5 |
| System.out.println(f.getCumPct("Ot")); // displays 0.25 |
| </source> |
| </dd> |
| <dd>Using case-insensitive comparator: |
| <source> |
| Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); |
| f.addValue("one"); |
| f.addValue("One"); |
| f.addValue("oNe"); |
| f.addValue("Z"); |
| System.out.println(f.getCount("one")); // displays 3 |
| System.out.println(f.getCumPct("z")); // displays 1 |
| </source> |
| </dd> |
| </dl> |
| </p> |
| </subsection> |
| <subsection name="1.4 Simple regression"> |
| <p> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/SimpleRegression.html"> |
| SimpleRegression</a> provides ordinary least squares regression with |
| one independent variable estimating the linear model: |
| </p> |
| <p> |
| <code> y = intercept + slope * x </code> |
| </p> |
| <p> |
| or |
| </p> |
| <p> |
| <code> y = slope * x </code> |
| </p> |
| <p> |
| Standard errors for <code>intercept</code> and <code>slope</code> are |
| available as well as ANOVA, r-square and Pearson's r statistics. |
| </p> |
| <p> |
| Observations (x,y pairs) can be added to the model one at a time or they |
| can be provided in a 2-dimensional array. The observations are not stored |
| in memory, so there is no limit to the number of observations that can be |
| added to the model. |
| </p> |
| <p> |
| <strong>Usage Notes</strong>: <ul> |
| <li> When there are fewer than two observations in the model, or when |
| there is no variation in the x values (i.e. all x values are the same) |
| all statistics return <code>NaN</code>. At least two observations with |
| different x coordinates are required to estimate a bivariate regression |
| model.</li> |
| <li> getters for the statistics always compute values based on the current |
| set of observations -- i.e., you can get statistics, then add more data |
| and get updated statistics without using a new instance. There is no |
| "compute" method that updates all statistics. Each of the getters performs |
| the necessary computations to return the requested statistic.</li> |
| <li> The intercept term may be suppressed by passing <code>false</code> to the |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/SimpleRegression.html#SimpleRegression(boolean)"> |
| SimpleRegression(boolean)</a> constructor. When the <code>hasIntercept</code> |
| property is false, the model is estimated without a constant term and |
| <code>getIntercept()</code> returns <code>0</code>.</li> |
| </ul> |
| </p> |
| <p> |
| <strong>Implementation Notes</strong>: <ul> |
| <li> As observations are added to the model, the sum of x values, y values, |
| cross products (x times y), and squared deviations of x and y from their |
| respective means are updated using updating formulas defined in |
| "Algorithms for Computing the Sample Variance: Analysis and |
| Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. |
| 1983, American Statistician, vol. 37, pp. 242-247, referenced in |
| Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression |
| statistics are computed from these sums.</li> |
| <li> Inference statistics (confidence intervals, parameter significance levels) |
| are based on on the assumption that the observations included in the model are |
| drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html"> |
| Bivariate Normal Distribution</a></li> |
| </ul> |
| </p> |
| <p> |
| Here are some examples. |
| <dl> |
| <dt>Estimate a model based on observations added one at a time</dt> |
| <dd>Instantiate a regression instance and add data points |
| <source> |
| regression = new SimpleRegression(); |
| regression.addData(1d, 2d); |
| // At this point, with only one observation, |
| // all regression statistics will return NaN |
| |
| regression.addData(3d, 3d); |
| // With only two observations, |
| // slope and intercept can be computed |
| // but inference statistics will return NaN |
| |
| regression.addData(3d, 3d); |
| // Now all statistics are defined. |
| </source> |
| </dd> |
| <dd>Compute some statistics based on observations added so far |
| <source> |
| System.out.println(regression.getIntercept()); |
| // displays intercept of regression line |
| |
| System.out.println(regression.getSlope()); |
| // displays slope of regression line |
| |
| System.out.println(regression.getSlopeStdErr()); |
| // displays slope standard error |
| </source> |
| </dd> |
| <dd>Use the regression model to predict the y value for a new x value |
| <source> |
| System.out.println(regression.predict(1.5d) |
| // displays predicted y value for x = 1.5 |
| </source> |
| More data points can be added and subsequent getXxx calls will incorporate |
| additional data in statistics. |
| </dd> |
| <br/> |
| <dt>Estimate a model from a double[][] array of data points</dt> |
| <dd>Instantiate a regression object and load dataset |
| <source> |
| double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; |
| SimpleRegression regression = new SimpleRegression(); |
| regression.addData(data); |
| </source> |
| </dd> |
| <dd>Estimate regression model based on data |
| <source> |
| System.out.println(regression.getIntercept()); |
| // displays intercept of regression line |
| |
| System.out.println(regression.getSlope()); |
| // displays slope of regression line |
| |
| System.out.println(regression.getSlopeStdErr()); |
| // displays slope standard error |
| </source> |
| More data points -- even another double[][] array -- can be added and subsequent |
| getXxx calls will incorporate additional data in statistics. |
| </dd> |
| <br/> |
| <dt>Estimate a model from a double[][] array of data points, <em>excluding</em> the intercept</dt> |
| <dd>Instantiate a regression object and load dataset |
| <source> |
| double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; |
| SimpleRegression regression = new SimpleRegression(false); |
| //the argument, false, tells the class not to include a constant |
| regression.addData(data); |
| </source> |
| </dd> |
| <dd>Estimate regression model based on data |
| <source> |
| System.out.println(regression.getIntercept()); |
| // displays intercept of regression line, since we have constrained the constant, 0.0 is returned |
| |
| System.out.println(regression.getSlope()); |
| // displays slope of regression line |
| |
| System.out.println(regression.getSlopeStdErr()); |
| // displays slope standard error |
| |
| System.out.println(regression.getInterceptStdErr() ); |
| // will return Double.NaN, since we constrained the parameter to zero |
| </source> |
| Caution must be exercised when interpreting the slope when no constant is being estimated. The slope |
| may be biased. |
| </dd> |
| |
| </dl> |
| </p> |
| </subsection> |
| <subsection name="1.5 Multiple linear regression"> |
| <p> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/OLSMultipleLinearRegression.html"> |
| OLSMultipleLinearRegression</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/GLSMultipleLinearRegression.html"> |
| GLSMultipleLinearRegression</a> provide least squares regression to fit the linear model: |
| </p> |
| <p> |
| <code> Y=X*b+u </code> |
| </p> |
| <p> |
| where Y is an n-vector <b>regressand</b>, X is a [n,k] matrix whose k columns are called |
| <b>regressors</b>, b is k-vector of <b>regression parameters</b> and u is an n-vector |
| of <b>error terms</b> or <b>residuals</b>. |
| </p> |
| <p> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/OLSMultipleLinearRegression.html"> |
| OLSMultipleLinearRegression</a> provides Ordinary Least Squares Regression, and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/GLSMultipleLinearRegression.html"> |
| GLSMultipleLinearRegression</a> implements Generalized Least Squares. See the javadoc for these |
| classes for details on the algorithms and formulas used. |
| </p> |
| <p> |
| Data for OLS models can be loaded in a single double[] array, consisting of concatenated rows of data, each containing |
| the regressand (Y) value, followed by regressor values; or using a double[][] array with rows corresponding to |
| observations. GLS models also require a double[][] array representing the covariance matrix of the error terms. See |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/AbstractMultipleLinearRegression.html#newSampleData(double[], int, int)"> |
| AbstractMultipleLinearRegression#newSampleData(double[],int,int)</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/OLSMultipleLinearRegression.html#newSampleData(double[], double[][])"> |
| OLSMultipleLinearRegression#newSampleData(double[], double[][])</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/regression/GLSMultipleLinearRegression.html#newSampleData(double[], double[][], double[][])"> |
| GLSMultipleLinearRegression#newSampleData(double[],double[][],double[][])</a> for details. |
| </p> |
| <p> |
| <strong>Usage Notes</strong>: <ul> |
| <li> Data are validated when invoking any of the newSample, newX, newY or newCovariance methods and |
| <code>IllegalArgumentException</code> is thrown when input data arrays do not have matching dimensions |
| or do not contain sufficient data to estimate the model. |
| </li> |
| <li> By default, regression models are estimated with intercept terms. In the notation above, this implies that the |
| X matrix contains an initial row identically equal to 1. X data supplied to the newX or newSample methods should not |
| include this column - the data loading methods will create it automatically. To estimate a model without an intercept |
| term, set the <code>noIntercept</code> property to <code>true.</code></li> |
| </ul> |
| </p> |
| <p> |
| Here are some examples. |
| <dl> |
| <dt>OLS regression</dt> |
| <br/> |
| <dd>Instantiate an OLS regression object and load a dataset: |
| <source> |
| OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression(); |
| double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; |
| double[][] x = new double[6][]; |
| x[0] = new double[]{0, 0, 0, 0, 0}; |
| x[1] = new double[]{2.0, 0, 0, 0, 0}; |
| x[2] = new double[]{0, 3.0, 0, 0, 0}; |
| x[3] = new double[]{0, 0, 4.0, 0, 0}; |
| x[4] = new double[]{0, 0, 0, 5.0, 0}; |
| x[5] = new double[]{0, 0, 0, 0, 6.0}; |
| regression.newSampleData(y, x); |
| </source> |
| </dd> |
| <dd>Get regression parameters and diagnostics: |
| <source> |
| double[] beta = regression.estimateRegressionParameters(); |
| |
| double[] residuals = regression.estimateResiduals(); |
| |
| double[][] parametersVariance = regression.estimateRegressionParametersVariance(); |
| |
| double regressandVariance = regression.estimateRegressandVariance(); |
| |
| double rSquared = regression.calculateRSquared(); |
| |
| double sigma = regression.estimateRegressionStandardError(); |
| </source> |
| </dd> |
| <dt>GLS regression</dt> |
| <br/> |
| <dd>Instantiate a GLS regression object and load a dataset: |
| <source> |
| GLSMultipleLinearRegression regression = new GLSMultipleLinearRegression(); |
| double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; |
| double[][] x = new double[6][]; |
| x[0] = new double[]{0, 0, 0, 0, 0}; |
| x[1] = new double[]{2.0, 0, 0, 0, 0}; |
| x[2] = new double[]{0, 3.0, 0, 0, 0}; |
| x[3] = new double[]{0, 0, 4.0, 0, 0}; |
| x[4] = new double[]{0, 0, 0, 5.0, 0}; |
| x[5] = new double[]{0, 0, 0, 0, 6.0}; |
| double[][] omega = new double[6][]; |
| omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; |
| omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; |
| omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; |
| omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; |
| omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; |
| omega[5] = new double[]{0, 0, 0, 0, 0, 6.6}; |
| regression.newSampleData(y, x, omega); |
| </source> |
| </dd> |
| </dl> |
| </p> |
| </subsection> |
| <subsection name="1.6 Rank transformations"> |
| <p> |
| Some statistical algorithms require that input data be replaced by ranks. |
| The <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/package-summary.html"> |
| org.apache.commons.math4.stat.ranking</a> package provides rank transformation. |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/RankingAlgorithm.html"> |
| RankingAlgorithm</a> defines the interface for ranking. |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/NaturalRanking.html"> |
| NaturalRanking</a> provides an implementation that has two configuration options. |
| <ul> |
| <li><a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/TiesStrategy.html"> |
| Ties strategy</a> deterimines how ties in the source data are handled by the ranking</li> |
| <li><a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/NaNStrategy.html"> |
| NaN strategy</a> determines how NaN values in the source data are handled.</li> |
| </ul> |
| </p> |
| <p> |
| Examples: |
| <source> |
| NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL, |
| TiesStrategy.MAXIMUM); |
| double[] data = { 20, 17, 30, 42.3, 17, 50, |
| Double.NaN, Double.NEGATIVE_INFINITY, 17 }; |
| double[] ranks = ranking.rank(exampleData); |
| </source> |
| results in <code>ranks</code> containing <code>{6, 5, 7, 8, 5, 9, 2, 2, 5}.</code> |
| <source> |
| new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData); |
| </source> |
| returns <code>{5, 2, 6, 7, 3, 8, 1, 4}.</code> |
| </p> |
| <p> |
| The default <code>NaNStrategy</code> is NaNStrategy.MAXIMAL. This makes <code>NaN</code> |
| values larger than any other value (including <code>Double.POSITIVE_INFINITY</code>). The |
| default <code>TiesStrategy</code> is <code>TiesStrategy.AVERAGE,</code> which assigns tied |
| values the average of the ranks applicable to the sequence of ties. See the |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/NaturalRanking.html"> |
| NaturalRanking</a> for more examples and <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/TiesStrategy.html"> |
| TiesStrategy</a> and <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/NaNStrategy.html">NaNStrategy</a> |
| for details on these configuration options. |
| </p> |
| </subsection> |
| <subsection name="1.7 Covariance and correlation"> |
| <p> |
| The <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/package-summary.html"> |
| org.apache.commons.math4.stat.correlation</a> package computes covariances |
| and correlations for pairs of arrays or columns of a matrix. |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/Covariance.html"> |
| Covariance</a> computes covariances, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/PearsonsCorrelation.html"> |
| PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/SpearmansCorrelation.html"> |
| SpearmansCorrelation</a> computes Spearman's rank correlation and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/KendallsCorrelation.html"> |
| KendallsCorrelation</a> computes Kendall's tau rank correlation. |
| </p> |
| <p> |
| <strong>Implementation Notes</strong> |
| <ul> |
| <li> |
| Unbiased covariances are given by the formula <br/> |
| <code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code> |
| where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code> |
| is the mean of the <code>Y</code> values. Non-bias-corrected estimates use |
| <code>n</code> in place of <code>n - 1.</code> Whether or not covariances are |
| bias-corrected is determined by the optional parameter, "biasCorrected," which |
| defaults to <code>true.</code> |
| </li> |
| <li> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/PearsonsCorrelation.html"> |
| PearsonsCorrelation</a> computes correlations defined by the formula <br/> |
| <code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br/> |
| where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code> |
| and <code>s(X)</code>, <code>s(Y)</code> are standard deviations. |
| </li> |
| <li> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/SpearmansCorrelation.html"> |
| SpearmansCorrelation</a> applies a rank transformation to the input data and computes Pearson's |
| correlation on the ranked data. The ranking algorithm is configurable. By default, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/ranking/NaturalRanking.html"> |
| NaturalRanking</a> with default strategies for handling ties and NaN values is used. |
| </li> |
| <li> |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/correlation/KendallsCorrelation.html"> |
| KendallsCorrelation</a> computes the association between two measured quantities. A tau test |
| is a non-parametric hypothesis test for statistical dependence based on the tau coefficient. |
| </li> |
| </ul> |
| </p> |
| <p> |
| <strong>Examples:</strong> |
| <dl> |
| <dt><strong>Covariance of 2 arrays</strong></dt> |
| <br/> |
| <dd>To compute the unbiased covariance between 2 double arrays, |
| <code>x</code> and <code>y</code>, use: |
| <source> |
| new Covariance().covariance(x, y) |
| </source> |
| For non-bias-corrected covariances, use |
| <source> |
| covariance(x, y, false) |
| </source> |
| </dd> |
| <br/> |
| <dt><strong>Covariance matrix</strong></dt> |
| <br/> |
| <dd> A covariance matrix over the columns of a source matrix <code>data</code> |
| can be computed using |
| <source> |
| new Covariance().computeCovarianceMatrix(data) |
| </source> |
| The i-jth entry of the returned matrix is the unbiased covariance of the ith and jth |
| columns of <code>data.</code> As above, to get non-bias-corrected covariances, |
| use |
| <source> |
| computeCovarianceMatrix(data, false) |
| </source> |
| </dd> |
| <br/> |
| <dt><strong>Pearson's correlation of 2 arrays</strong></dt> |
| <br/> |
| <dd>To compute the Pearson's product-moment correlation between two double arrays |
| <code>x</code> and <code>y</code>, use: |
| <source> |
| new PearsonsCorrelation().correlation(x, y) |
| </source> |
| </dd> |
| <br/> |
| <dt><strong>Pearson's correlation matrix</strong></dt> |
| <br/> |
| <dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code> |
| can be computed using |
| <source> |
| new PearsonsCorrelation().computeCorrelationMatrix(data) |
| </source> |
| The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the |
| ith and jth columns of <code>data.</code> |
| </dd> |
| <br/> |
| <dt><strong>Pearson's correlation significance and standard errors</strong></dt> |
| <br/> |
| <dd> To compute standard errors and/or significances of correlation coefficients |
| associated with Pearson's correlation coefficients, start by creating a |
| <code>PearsonsCorrelation</code> instance |
| <source> |
| PearsonsCorrelation correlation = new PearsonsCorrelation(data); |
| </source> |
| where <code>data</code> is either a rectangular array or a <code>RealMatrix.</code> |
| Then the matrix of standard errors is |
| <source> |
| correlation.getCorrelationStandardErrors(); |
| </source> |
| The formula used to compute the standard error is <br/> |
| <code>SE<sub>r</sub> = ((1 - r<sup>2</sup>) / (n - 2))<sup>1/2</sup></code><br/> |
| where <code>r</code> is the estimated correlation coefficient and |
| <code>n</code> is the number of observations in the source dataset.<br/><br/> |
| <strong>p-values</strong> for the (2-sided) null hypotheses that elements of |
| a correlation matrix are zero populate the RealMatrix returned by |
| <source> |
| correlation.getCorrelationPValues() |
| </source> |
| <code>getCorrelationPValues().getEntry(i,j)</code> is the |
| probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes |
| a value with absolute value greater than or equal to <br/> |
| <code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>, |
| where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth |
| columns of the source array or RealMatrix. This is sometimes referred to as the |
| <i>significance</i> of the coefficient.<br/><br/> |
| For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then |
| <source> |
| new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1) |
| </source> |
| is the significance of the Pearson's correlation coefficient between the two columns |
| of <code>data</code>. If this value is less than .01, we can say that the correlation |
| between the two columns of data is significant at the 99% level. |
| </dd> |
| <br/> |
| <dt><strong>Spearman's rank correlation coefficient</strong></dt> |
| <br/> |
| <dd>To compute the Spearman's rank-moment correlation between two double arrays |
| <code>x</code> and <code>y</code>: |
| <source> |
| new SpearmansCorrelation().correlation(x, y) |
| </source> |
| This is equivalent to |
| <source> |
| RankingAlgorithm ranking = new NaturalRanking(); |
| new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y)) |
| </source> |
| </dd> |
| <br/> |
| <dt><strong>Kendalls's tau rank correlation coefficient</strong></dt> |
| <br/> |
| <dd>To compute the Kendall's tau rank correlation between two double arrays |
| <code>x</code> and <code>y</code>: |
| <source> |
| new KendallsCorrelation().correlation(x, y) |
| </source> |
| </dd> |
| </dl> |
| </p> |
| </subsection> |
| <subsection name="1.8 Statistical tests"> |
| <p> |
| The <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/"> |
| org.apache.commons.math4.stat.inference</a> package provides implementations for |
| <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm"> |
| Student's t</a>, |
| <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm"> |
| Chi-Square</a>, |
| <a href="http://en.wikipedia.org/wiki/G-test">G Test</a>, |
| <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm"> |
| One-Way ANOVA</a>, |
| <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/prc35.htm"> |
| Mann-Whitney U</a>, |
| <a href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test"> |
| Wilcoxon signed rank</a> and |
| <a href="http://en.wikipedia.org/wiki/Binomial_test"> |
| Binomial</a> test statistics as well as |
| <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue"> |
| p-values</a> associated with <code>t-</code>, |
| <code>Chi-Square</code>, <code>G</code>, <code>One-Way ANOVA</code>, <code>Mann-Whitney U</code> |
| <code>Wilcoxon signed rank</code>, and <code>Kolmogorov-Smirnov</code> tests. |
| The respective test classes are |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/TTest.html"> |
| TTest</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/ChiSquareTest.html"> |
| ChiSquareTest</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/GTest.html"> |
| GTest</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/OneWayAnova.html"> |
| OneWayAnova</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/MannWhitneyUTest.html"> |
| MannWhitneyUTest</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/WilcoxonSignedRankTest.html"> |
| WilcoxonSignedRankTest</a>, |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/BinomialTest.html"> |
| BinomialTest</a> and |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/KolmogorovSmirnovTest.html"> |
| KolmogorovSmirnovTest</a>. |
| The <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/inference/TestUtils.html"> |
| TestUtils</a> class provides static methods to get test instances or |
| to compute test statistics directly. The examples below all use the |
| static methods in <code>TestUtils</code> to execute tests. To get |
| test object instances, either use e.g., <code>TestUtils.getTTest()</code> |
| or use the implementation constructors directly, e.g. <code>new TTest()</code>. |
| </p> |
| <p> |
| <strong>Implementation Notes</strong> |
| <ul> |
| <li>Both one- and two-sample t-tests are supported. Two sample tests |
| can be either paired or unpaired and the unpaired two-sample tests can |
| be conducted under the assumption of equal subpopulation variances or |
| without this assumption. When equal variances is assumed, a pooled |
| variance estimate is used to compute the t-statistic and the degrees |
| of freedom used in the t-test equals the sum of the sample sizes minus 2. |
| When equal variances is not assumed, the t-statistic uses both sample |
| variances and the |
| <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif"> |
| Welch-Satterwaite approximation</a> is used to compute the degrees |
| of freedom. Methods to return t-statistics and p-values are provided in each |
| case, as well as boolean-valued methods to perform fixed significance |
| level tests. The names of methods or methods that assume equal |
| subpopulation variances always start with "homoscedastic." Test or |
| test-statistic methods that just start with "t" do not assume equal |
| variances. See the examples below and the API documentation for |
| more details.</li> |
| <li>The validity of the p-values returned by the t-test depends on the |
| assumptions of the parametric t-test procedure, as discussed |
| <a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html"> |
| here</a></li> |
| <li>p-values returned by t-, chi-square and ANOVA tests are exact, based |
| on numerical approximations to the t-, chi-square and F distributions in the |
| <code>distributions</code> package. </li> |
| <li>The G test implementation provides two p-values: |
| <code>gTest(expected, observed)</code>, which is the tail probability beyond |
| <code>g(expected, observed)</code> in the ChiSquare distribution with degrees |
| of freedom one less than the common length of input arrays and |
| <code>gTestIntrinsic(expected, observed)</code> which is the same tail |
| probability computed using a ChiSquare distribution with one less degeree |
| of freedom. </li> |
| <li>p-values returned by t-tests are for two-sided tests and the boolean-valued |
| methods supporting fixed significance level tests assume that the hypotheses |
| are two-sided. One sided tests can be performed by dividing returned p-values |
| (resp. critical values) by 2.</li> |
| <li>Degrees of freedom for G- and chi-square tests are integral values, based on the |
| number of observed or expected counts (number of observed counts - 1).</li> |
| <li> The KolmogorovSmirnov test uses a statistic based on the maximum deviation of |
| the empirical distribution of sample data points from the distribution expected |
| under the null hypothesis. Specifically, what is computed is |
| \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and |
| \(F_n\) is the empirical distribution of the \(n\) sample data points. Both |
| one-sample tests against a <code>RealDistribution</code> and two-sample tests |
| (comparing two empirical distributions) are supported. For one-sample tests, |
| the distribution of \(D_n\) is estimated using the method in |
| <a href="http://www.jstatsoft.org/v08/i18/">Evaluating Kolmogorov's Distribution</a> by |
| George Marsaglia, Wai Wan Tsang, and Jingbo Wang, with quick decisions in some cases |
| for extreme values using the method described in |
| <a href="http://www.jstatsoft.org/v39/i11/"> Computing the Two-Sided Kolmogorov-Smirnov |
| Distribution</a> by Richard Simard and Pierre L'Ecuyer. In the 2-sample case, estimation |
| by default depends on the number of data points. For small samples, the distribution |
| is computed exactly and for large samples a numerical approximation of the Kolmogorov |
| distribution is used. Methods to perform each type of p-value estimation are also exposed |
| directly. See the class javadoc for details.</li> |
| </ul> |
| </p> |
| <p> |
| <strong>Examples:</strong> |
| <dl> |
| <dt><strong>One-sample <code>t</code> tests</strong></dt> |
| <br/> |
| <dd>To compare the mean of a double[] array to a fixed value: |
| <source> |
| double[] observed = {1d, 2d, 3d}; |
| double mu = 2.5d; |
| System.out.println(TestUtils.t(mu, observed)); |
| </source> |
| The code above will display the t-statistic associated with a one-sample |
| t-test comparing the mean of the <code>observed</code> values against |
| <code>mu.</code> |
| </dd> |
| <dd>To compare the mean of a dataset described by a |
| <a href="../commons-math-docs/apidocs/org/apache/commons/math4/legacy/stat/descriptive/StatisticalSummary.html"> |
| StatisticalSummary</a> to a fixed value: |
| <source> |
| double[] observed ={1d, 2d, 3d}; |
| double mu = 2.5d; |
| SummaryStatistics sampleStats = new SummaryStatistics(); |
| for (int i = 0; i < observed.length; i++) { |
| sampleStats.addValue(observed[i]); |
| } |
| System.out.println(TestUtils.t(mu, observed)); |
| </source> |
| </dd> |
| <dd>To compute the p-value associated with the null hypothesis that the mean |
| of a set of values equals a point estimate, against the two-sided alternative that |
| the mean is different from the target value: |
| <source> |
| double[] observed = {1d, 2d, 3d}; |
| double mu = 2.5d; |
| System.out.println(TestUtils.tTest(mu, observed)); |
| </source> |
| The snippet above will display the p-value associated with the null |
| hypothesis that the mean of the population from which the |
| <code>observed</code> values are drawn equals <code>mu.</code> |
| </dd> |
| <dd>To perform the test using a fixed significance level, use: |
| <source> |
| TestUtils.tTest(mu, observed, alpha); |
| </source> |
| where <code>0 < alpha < 0.5</code> is the significance level of |
| the test. The boolean value returned will be <code>true</code> iff the |
| null hypothesis can be rejected with confidence <code>1 - alpha</code>. |
| To test, for example at the 95% level of confidence, use |
| <code>alpha = 0.05</code> |
| </dd> |
| <br/> |
| <dt><strong>Two-Sample t-tests</strong></dt> |
| <br/> |
| <dd><strong>Example 1:</strong> Paired test evaluating |
| the null hypothesis that the mean difference between corresponding |
| (paired) elements of the <code>double[]</code> arrays |
| <code>sample1</code> and <code>sample2</code> is zero. |
| <p> |
| To compute the t-statistic: |
| <source> |
| TestUtils.pairedT(sample1, sample2); |
| </source> |
| </p> |
| <p> |
| To compute the p-value: |
| <source> |
| TestUtils.pairedTTest(sample1, sample2); |
| </source> |
| </p> |
| <p> |
| To perform a fixed significance level test with alpha = .05: |
| <source> |
| TestUtils.pairedTTest(sample1, sample2, .05); |
| </source> |
| </p> |
| The last example will return <code>true</code> iff the p-value |
| returned by <code>TestUtils.pairedTTest(sample1, sample2)</code> |
| is less than <code>.05</code> |
| </dd> |
| <dd><strong>Example 2: </strong> unpaired, two-sided, two-sample t-test using |
| <code>StatisticalSummary</code> instances, without assuming that |
| subpopulation variances are equal. |
| <p> |
| First create the <code>StatisticalSummary</code> instances. Both |
| <code>DescriptiveStatistics</code> and <code>SummaryStatistics</code> |
| implement this interface. Assume that <code>summary1</code> and |
| <code>summary2</code> are <code>SummaryStatistics</code> instances, |
| each of which has had at least 2 values added to the (virtual) dataset that |
| it describes. The sample sizes do not have to be the same -- all that is required |
| is that both samples have at least 2 elements. |
| </p> |
| <p><strong>Note:</strong> The <code>SummaryStatistics</code> class does |
| not store the dataset that it describes in memory, but it does compute all |
| statistics necessary to perform t-tests, so this method can be used to |
| conduct t-tests with very large samples. One-sample tests can also be |
| performed this way. |
| (See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details |
| on the <code>SummaryStatistics</code> class.) |
| </p> |
| <p> |
| To compute the t-statistic: |
| <source> |
| TestUtils.t(summary1, summary2); |
| </source> |
| </p> |
| <p> |
| To compute the p-value: |
| <source> |
| TestUtils.tTest(sample1, sample2); |
| </source> |
| </p> |
| <p> |
| To perform a fixed significance level test with alpha = .05: |
| <source> |
| TestUtils.tTest(sample1, sample2, .05); |
| </source> |
| </p> |
| <p> |
| In each case above, the test does not assume that the subpopulation |
| variances are equal. To perform the tests under this assumption, |
| replace "t" at the beginning of the method name with "homoscedasticT" |
| </p> |
| </dd> |
| <br/> |
| <dt><strong>Chi-square tests</strong></dt> |
| <br/> |
| <dd>To compute a chi-square statistic measuring the agreement between a |
| <code>long[]</code> array of observed counts and a <code>double[]</code> |
| array of expected counts, use: |
| <source> |
| long[] observed = {10, 9, 11}; |
| double[] expected = {10.1, 9.8, 10.3}; |
| System.out.println(TestUtils.chiSquare(expected, observed)); |
| </source> |
| the value displayed will be |
| <code>sum((expected[i] - observed[i])^2 / expected[i])</code> |
| </dd> |
| <dd> To get the p-value associated with the null hypothesis that |
| <code>observed</code> conforms to <code>expected</code> use: |
| <source> |
| TestUtils.chiSquareTest(expected, observed); |
| </source> |
| </dd> |
| <dd> To test the null hypothesis that <code>observed</code> conforms to |
| <code>expected</code> with <code>alpha</code> significance level |
| (equiv. <code>100 * (1-alpha)%</code> confidence) where <code> |
| 0 < alpha < 1 </code> use: |
| <source> |
| TestUtils.chiSquareTest(expected, observed, alpha); |
| </source> |
| The boolean value returned will be <code>true</code> iff the null hypothesis |
| can be rejected with confidence <code>1 - alpha</code>. |
| </dd> |
| <dd>To compute a chi-square statistic statistic associated with a |
| <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm"> |
| chi-square test of independence</a> based on a two-dimensional (long[][]) |
| <code>counts</code> array viewed as a two-way table, use: |
| <source> |
| TestUtils.chiSquareTest(counts); |
| </source> |
| The rows of the 2-way table are |
| <code>count[0], ... , count[count.length - 1]. </code><br/> |
| The chi-square statistic returned is |
| <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code> |
| where the sum is taken over all table entries and |
| <code>expected[i][j]</code> is the product of the row and column sums at |
| row <code>i</code>, column <code>j</code> divided by the total count. |
| </dd> |
| <dd>To compute the p-value associated with the null hypothesis that |
| the classifications represented by the counts in the columns of the input 2-way |
| table are independent of the rows, use: |
| <source> |
| TestUtils.chiSquareTest(counts); |
| </source> |
| </dd> |
| <dd>To perform a chi-square test of independence with <code>alpha</code> |
| significance level (equiv. <code>100 * (1-alpha)%</code> confidence) |
| where <code>0 < alpha < 1 </code> use: |
| <source> |
| TestUtils.chiSquareTest(counts, alpha); |
| </source> |
| The boolean value returned will be <code>true</code> iff the null |
| hypothesis can be rejected with confidence <code>1 - alpha</code>. |
| </dd> |
| <br/> |
| <dt><strong>G tests</strong></dt> |
| <br/> |
| <dd>G tests are an alternative to chi-square tests that are recommended |
| when observed counts are small and / or incidence probabilities for |
| some cells are small. See Ted Dunning's paper, |
| <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962"> |
| Accurate Methods for the Statistics of Surprise and Coincidence</a> for |
| background and an empirical analysis showing now chi-square |
| statistics can be misleading in the presence of low incidence probabilities. |
| This paper also derives the formulas used in computing G statistics and the |
| root log likelihood ratio provided by the <code>GTest</code> class. |
| </dd> |
| <dd>To compute a G-test statistic measuring the agreement between a |
| <code>long[]</code> array of observed counts and a <code>double[]</code> |
| array of expected counts, use: |
| <source> |
| double[] expected = new double[]{0.54d, 0.40d, 0.05d, 0.01d}; |
| long[] observed = new long[]{70, 79, 3, 4}; |
| System.out.println(TestUtils.g(expected, observed)); |
| </source> |
| the value displayed will be |
| <code>2 * sum(observed[i]) * log(observed[i]/expected[i])</code> |
| </dd> |
| <dd>To get the p-value associated with the null hypothesis that |
| <code>observed</code> conforms to <code>expected</code> use: |
| <source> |
| TestUtils.gTest(expected, observed); |
| </source> |
| </dd> |
| <dd>To test the null hypothesis that <code>observed</code> conforms to |
| <code>expected</code> with <code>alpha</code> siginficance level |
| (equiv. <code>100 * (1-alpha)%</code> confidence) where <code> |
| 0 < alpha < 1 </code> use: |
| <source> |
| TestUtils.gTest(expected, observed, alpha); |
| </source> |
| The boolean value returned will be <code>true</code> iff the null hypothesis |
| can be rejected with confidence <code>1 - alpha</code>. |
| </dd> |
| <dd>To evaluate the hypothesis that two sets of counts come from the |
| same underlying distribution, use long[] arrays for the counts and |
| <code>gDataSetsComparison</code> for the test statistic |
| <source> |
| long[] obs1 = new long[]{268, 199, 42}; |
| long[] obs2 = new long[]{807, 759, 184}; |
| System.out.println(TestUtils.gDataSetsComparison(obs1, obs2)); // G statistic |
| System.out.println(TestUtils.gTestDataSetsComparison(obs1, obs2)); // p-value |
| </source> |
| </dd> |
| <dd>For 2 x 2 designs, the <code>rootLogLikelihoodRatio</code> method |
| computes the |
| <a href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html"> |
| signed root log likelihood ratio.</a> For example, suppose that for two events |
| A and B, the observed count of AB (both occurring) is 5, not A and B (B without A) |
| is 1995, A not B is 0; and neither A nor B is 10000. Then |
| <source> |
| new GTest().rootLogLikelihoodRatio(5, 1995, 0, 100000); |
| </source> |
| returns the root log likelihood associated with the null hypothesis that A |
| and B are independent. |
| </dd> |
| <br/> |
| <dt><strong>One-Way ANOVA tests</strong></dt> |
| <br/> |
| <dd> |
| <source> |
| double[] classA = |
| {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; |
| double[] classB = |
| {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 }; |
| double[] classC = |
| {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 }; |
| List classes = new ArrayList(); |
| classes.add(classA); |
| classes.add(classB); |
| classes.add(classC); |
| </source> |
| Then you can compute ANOVA F- or p-values associated with the |
| null hypothesis that the class means are all the same |
| using a <code>OneWayAnova</code> instance or <code>TestUtils</code> |
| methods: |
| <source> |
| double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value |
| double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value |
| </source> |
| To test perform a One-Way ANOVA test with significance level set at 0.01 |
| (so the test will, assuming assumptions are met, reject the null |
| hypothesis incorrectly only about one in 100 times), use |
| <source> |
| TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean |
| // true means reject null hypothesis |
| </source> |
| </dd> |
| <br/> |
| <dt><strong>Kolmogorov-Smirnov tests</strong></dt> |
| <br/> |
| <dd>Given a double[] array <code>data</code> of values, to evaluate the |
| null hypothesis that the values are drawn from a unit normal distribution |
| <source> |
| final NormalDistribution unitNormal = new NormalDistribution(0d, 1d); |
| TestUtils.kolmogorovSmirnovTest(unitNormal, sample, false) |
| </source> |
| returns the p-value and |
| <source> |
| TestUtils.kolmogorovSmirnovStatistic(unitNormal, sample) |
| </source> |
| returns the D-statistic. |
| <br/> |
| If <code>y</code> is a double array, to evaluate the null hypothesis that |
| <code>x</code> and <code>y</code> are drawn from the same underlying distribution, |
| use |
| <source> |
| TestUtils.kolmogorovSmirnovStatistic(x, y) |
| </source> |
| to compute the D-statistic and |
| <source> |
| TestUtils.kolmogorovSmirnovTest(x, y) |
| </source> |
| for the p-value associated with the null hypothesis that <code>x</code> and |
| <code>y</code> come from the same distribution. By default, here and above strict |
| inequality is used in the null hypothesis - i.e., we evaluate \(H_0 : D_{n,m} > d \). |
| To make the inequality above non-strict, add <code>false</code> as an actual parameter |
| above. For large samples, this parameter makes no difference. |
| <br/> |
| To force exact computation of the p-value (overriding the selection of estimation |
| method), first compute the d-statistic and then use the <code>exactP</code> method |
| <source> |
| final double d = TestUtils.kolmogorovSmirnovStatistic(x, y); |
| TestUtils.exactP(d, x.length, y.length, false) |
| </source> |
| assuming that the non-strict form of the null hypothesis is desired. Note, however, |
| that exact computation for large samples takes a long time. |
| </dd> |
| </dl> |
| </p> |
| </subsection> |
| </section> |
| </body> |
| </document> |