| <?xml version="1.0"?> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <document xmlns="http://maven.apache.org/XDOC/2.0" |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
| xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd"> |
| <properties> |
| <title>Apache Commons Statistics User Guide</title> |
| </properties> |
| |
| <body> |
| |
| <section name="" id="page_title"> |
| <h1>Apache Commons Statistics User Guide</h1> |
| </section> |
| |
| <section name="Contents" id="toc"> |
| <ul> |
| <li> |
| <a href="#overview">Overview</a> |
| </li> |
| <li> |
| <a href="#example-modules">Example Modules</a> |
| </li> |
| <li> |
| <a href="#descriptive">Descriptive Statistics</a> |
| <ul> |
| <li> |
| <a href="#desc_overview">Overview</a> |
| </li> |
| <li> |
| <a href="#desc_examples">Examples</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#distributions">Probability Distributions</a> |
| <ul> |
| <li> |
| <a href="#dist_overview">Overview</a> |
| </li> |
| <li> |
| <a href="#dist_api">API</a> |
| </li> |
| <li> |
| <a href="#dist_imp_details">Implementation Details</a> |
| </li> |
| <li> |
| <a href="#dist_complements">Complementary Probabilities</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#inference">Inference</a> |
| <ul> |
| <li> |
| <a href="#inference_overview">Overview</a> |
| </li> |
| <li> |
| <a href="#inference_examples">Examples</a> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <a href="#ranking">Ranking</a> |
| </li> |
| </ul> |
| </section> |
| |
| <section name="Overview" id="overview"> |
| <p> |
| Apache Commons Statistics provides utilities for statistical applications. The code |
| originated in the <code><a href="https://commons.apache.org/proper/commons-math/"> |
| commons-math</a></code> project but was pulled out into a separate project for better |
| maintainability and has since undergone numerous improvements. |
| </p> |
| |
| <p> |
| Commons Statistics is divided into a number of submodules: |
| </p> |
| <ul> |
| <li> |
| <code><a href="../commons-statistics-descriptive/index.html"> |
| commons-statistics-descriptive</a></code> - Provides computation |
| of descriptive statistics (mean, variance, median, etc). |
| </li> |
| <li> |
| <code><a href="../commons-statistics-distribution/index.html"> |
| commons-statistics-distribution</a></code> - Provides interfaces |
| and classes for probability distributions. |
| </li> |
| <li> |
| <code><a href="../commons-statistics-inference/index.html"> |
| commons-statistics-inference</a></code> - Provides hypothesis testing. |
| </li> |
| <li> |
| <code><a href="../commons-statistics-ranking/index.html"> |
| commons-statistics-ranking</a></code> - Provides rank transformations. |
| </li> |
| </ul> |
| </section> |
| |
| <section name="Example Modules" id="example-modules"> |
| <p> |
| In addition to the modules above, the Commons Statistics |
| <a href="https://commons.apache.org/statistics/download_statistics.cgi">source distribution</a> |
| contains example code demonstrating library functionality and/or providing useful |
| development utilities. These modules are not part of the public API of the library and no |
| guarantees are made concerning backwards compatibility. The |
| <a href="../commons-statistics-examples/modules.html">example module parent page</a> |
| contains a listing of available modules. |
| </p> |
| <hr/> |
| </section> |
| |
| <section name="Descriptive Statistics" id="descriptive"> |
| <p> |
| The <code>commons-statistics-descriptive</code> module provides descriptive statistics. |
| </p> |
| <subsection name="Overview" id="desc_overview"> |
| <p> |
| The module provides classes to compute univariate statistics on <code>double</code>, |
| <code>int</code> and <code>long</code> data using array input or a Java stream. The |
| result is returned as a |
| <a href="../commons-statistics-descriptive/apidocs/org/apache/commons/statistics/descriptive/StatisticResult.html">StatisticResult</a>. |
| The <code>StatisticResult</code> provides methods to supply the result as a |
| <code>double</code>, <code>int</code>, <code>long</code> and <code>BigInteger</code>. |
| The integer types allow the exact result to be returned for integer data. For example |
| the sum of <code>long</code> values may not be exactly representable as a |
| <code>double</code> and may overflow a <code>long</code>. |
| </p> |
| <p> |
| Computation of an individual statistic involves creating an instance of |
| <code>StatisticResult</code> that can supply the current statistic value. |
| To allow addition of single values to update the statistic, instances |
| implement the primitive consumer interface for the supported type: |
| <code>DoubleConsumer</code>, <code>IntConsumer</code>, or <code>LongConsumer</code>. |
| Instances implement the |
| <a href="../commons-statistics-descriptive/apidocs/org/apache/commons/statistics/descriptive/StatisticAccumulator.html">StatisticAccumulator</a> |
| interface and can be combined with other instances. This allows computation in parallel on |
| subsets of data and combination to a final result. This can be performed using the |
| Java stream API. |
| </p> |
| <p> |
| Computation of multiple statistics uses a |
| <a href="../commons-statistics-descriptive/apidocs/org/apache/commons/statistics/descriptive/Statistic.html">Statistic</a> |
| enumeration to define the statistics to evaluate. A container class is created to |
| compute the desired statistics together and allows multiple statistics to be computed |
| concurrently using the Java stream API. Each statistic result is obtained using the |
| <code>Statistic</code> enum to access the required value. Providing a choice of the |
| statistics allows the user to avoid the computational cost of results that are not |
| required. |
| </p> |
| <p> |
| Note that <code>double</code> computations are subject to accumulated floating-point |
| rounding which can generate different results from permuted input data. Computation |
| on an array of <code>double</code> data can use a multiple-pass algorithm to increase |
| accuracy over a single-pass stream of values. This is the recommended approach if |
| all data is already stored in an array (i.e. is not dynamically generated). |
| </p> |
| <p> |
| If the data is an integer type then it is |
| preferred to use the integer specializations of the statistics. |
| Many implementations use exact integer math for the computation. This is faster than |
| using a <code>double</code> data type, more accurate and returns the same result |
| irrespective of the input order of the data. Note that for improved performance there |
| is no use of <code>BigInteger</code> in the accumulation of intermediate values; the |
| computation uses mutable fixed-precision integer classes for totals that may |
| overflow 64-bits. |
| </p> |
| <p> |
| Some statistics cannot be computed using a stream since they require all values for |
| computation, for example the median. These are evaluated on an array using an instance |
| of a computing class. The instance allows computation options to be changed. Instances |
| are immutable and the computation is thread-safe. |
| </p> |
| </subsection> |
| <subsection name="Examples" id="desc_examples"> |
| <p> |
| Computation of a single statistic from an array of values, or a stream of data: |
| </p> |
| <source class="prettyprint"> |
| int[] values = {1, 1, 2, 3, 5, 8, 13, 21}; |
| |
| double v = IntVariance.of(values).getAsDouble(); |
| |
| double m = Stream.of("one", "two", "three", "four") |
| .mapToInt(String::length) |
| .collect(IntMean::create, IntMean::accept, IntMean::combine) |
| .getAsDouble(); |
| </source> |
| <p> |
| Computation of multiple statistics uses the <code>Statistic</code> enum. |
| These can be specified using an <code>EnumSet</code> together with the input array data. |
| Note that some statistics share the same underlying computation, for example the variance, |
| standard deviation and mean. When a container class is constructed using one of the |
| statistics, the other co-computed statistics are available in the result even if not |
| specified during construction. The <code>isSupported</code> method can |
| identify all results that are available from the container class. |
| </p> |
| <source class="prettyprint"> |
| double[] data = {1, 2, 3, 4, 5, 6, 7, 8}; |
| DoubleStatistics stats = DoubleStatistics.of( |
| EnumSet.of(Statistic.MIN, Statistic.MAX, Statistic.VARIANCE), |
| data); |
| |
| stats.getAsDouble(Statistic.MIN); // 1.0 |
| stats.getAsDouble(Statistic.MAX); // 8.0 |
| stats.getAsDouble(Statistic.VARIANCE); // 6.0 |
| |
| // Get other statistics supported by the underlying computations |
| stats.isSupported(Statistic.STANDARD_DEVIATION)); // true |
| stats.getAsDouble(Statistic.STANDARD_DEVIATION); // 2.449... |
| </source> |
| <p> |
| Computation of multiple statistics on individual values can accumulate the results |
| using the <code>accept</code> method of the container class: |
| </p> |
| <source class="prettyprint"> |
| IntStatistics stats = IntStatistics.of( |
| Statistic.MIN, Statistic.MAX, Statistic.MEAN); |
| Stream.of("one", "two", "three", "four") |
| .mapToInt(String::length) |
| .forEach(stats::accept); |
| |
| stats.getAsInt(Statistic.MIN); // 3 |
| stats.getAsInt(Statistic.MAX); // 5 |
| stats.getAsDouble(Statistic.MEAN); // 15.0 / 4 |
| </source> |
| <p> |
| Computation of multiple statistics on a stream of values in parallel. |
| This requires use of a <code>Builder</code> that |
| can supply instances of the container class to each worker with the |
| <code>build</code> method; populated using <code>accept</code>; and then collected |
| using <code>combine</code>: |
| </p> |
| <source class="prettyprint"> |
| IntStatistics.Builder builder = IntStatistics.builder( |
| Statistic.MIN, Statistic.MAX, Statistic.MEAN); |
| IntStatistics stats = corpus.stream() |
| Stream.of("one", "two", "three", "four") |
| .parallel() |
| .mapToInt(String::length) |
| .collect(builder::build, IntConsumer::accept, IntStatistics::combine); |
| |
| stats.getAsInt(Statistic.MIN); // 3 |
| stats.getAsInt(Statistic.MAX); // 5 |
| stats.getAsDouble(Statistic.MEAN); // 15.0 / 4 |
| </source> |
| <p> |
| Computation on multiple arrays. This requires use of a <code>Builder</code> that |
| can supply instances of the container class to compute each array with the |
| <code>build</code> method: |
| </p> |
| <source class="prettyprint"> |
| double[][] data = { |
| {1, 2, 3, 4}, |
| {5, 6, 7, 8}, |
| }; |
| DoubleStatistics.Builder builder = DoubleStatistics.builder( |
| Statistic.MIN, Statistic.MAX, Statistic.VARIANCE); |
| DoubleStatistics stats = Arrays.stream(data) |
| .map(builder::build) |
| .reduce(DoubleStatistics::combine) |
| .get(); |
| |
| stats.getAsDouble(Statistic.MIN); // 1.0 |
| stats.getAsDouble(Statistic.MAX); // 8.0 |
| stats.getAsDouble(Statistic.VARIANCE); // 6.0 |
| |
| // Get other statistics supported by the underlying computations |
| stats.isSupported(Statistic.MEAN)); // true |
| stats.getAsDouble(Statistic.MEAN); // 4.5 |
| </source> |
| <p> |
| If computation on multiple arrays is to be repeated then this can be done with a |
| re-useable <code>java.util.stream.Collector</code>: |
| </p> |
| <source class="prettyprint"> |
| double[][] data = { |
| {1, 2, 3, 4}, |
| {5, 6, 7, 8}, |
| }; |
| DoubleStatistics.Builder builder = DoubleStatistics.builder( |
| Statistic.MIN, Statistic.MAX, Statistic.VARIANCE); |
| Collector<double[], DoubleStatistics, DoubleStatistics> collector = |
| Collector.of(builder::build, (s, d) -> s.combine(builder.build(d)), DoubleStatistics::combine); |
| DoubleStatistics stats = Arrays.stream(data).collect(collector); |
| |
| stats.getAsDouble(Statistic.MIN); // 1.0 |
| stats.getAsDouble(Statistic.MAX); // 8.0 |
| stats.getAsDouble(Statistic.VARIANCE); // 6.0 |
| </source> |
| <p> |
| Combination of multiple statistics requires them to be compatible, i.e. all supported |
| statistics in one container are also supported in the other. Note that combining another |
| container ignores any unsupported statistics and the compatibility |
| may be asymmetric. |
| </p> |
| <source class="prettyprint"> |
| double[] data1 = {1, 2, 3, 4}; |
| double[] data2 = {5, 6, 7, 8}; |
| DoubleStatistics varStats = DoubleStatistics.builder(Statistic.VARIANCE).build(data1); |
| DoubleStatistics meanStats = DoubleStatistics.builder(Statistic.MEAN).build(data2); |
| |
| // throws IllegalArgumentException |
| varStats.combine(meanStats); |
| |
| // OK - mean is updated to 4.5 |
| meanStats.combine(varStats) |
| </source> |
| <p> |
| Computation of a statistic that requires all data (i.e. does not support the |
| <code>Stream</code> API) uses a configurable instance of the computing class: |
| </p> |
| <source class="prettyprint"> |
| double[] data = {8, 7, 6, 5, 4, 3, 2, 1}; |
| // Configure the statistic |
| double m = Median.withDefaults() |
| .withCopy(true) // do not modify the input array |
| .with(NaNPolicy.ERROR) // raise an exception for NaN |
| .evaluate(data); |
| // m = 4.5 |
| </source> |
| <p> |
| Computation of multiple values of a statistic that requires all data: |
| </p> |
| <source class="prettyprint"> |
| int size = 10000; |
| double origin = 0; |
| double bound = 100; |
| double[] data = |
| new SplittableRandom(123) |
| .doubles(size, origin, bound) |
| .toArray(); |
| // Evaluate multiple statistics on the same data |
| double[] q = Quantile.withDefaults() |
| .evaluate(data, 0.25, 0.5, 0.75); // probabilities |
| // q ~ [25.0, 50.0, 75.0] |
| </source> |
| </subsection> |
| </section> |
| |
| <section name="Probability Distributions" id="distributions"> |
| <subsection name="Overview" id="dist_overview"> |
| <p> |
| The <code>commons-statistics-distribution</code> module provides a framework and implementations for some commonly used |
| probability distributions. Continuous univariate distributions are represented by |
| implementations of the |
| <a href="../commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/ContinuousDistribution.html">ContinuousDistribution</a> |
| interface. Discrete distributions implement |
| <a href="../commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/DiscreteDistribution.html">DiscreteDistribution</a> |
| (values must be mapped to integers). |
| </p> |
| </subsection> |
| <subsection name="API" id="dist_api"> |
| <p> |
| The distribution framework provides the means to compute probability density, |
| probability mass and cumulative probability functions for several well-known |
| discrete (integer-valued) and continuous probability distributions. |
| The API also allows for the computation of inverse cumulative probabilities |
| and sampling from distributions. |
| </p> |
| <p> |
| For an instance <code>f</code> of a distribution <code>F</code>, |
| and a domain value, <code>x</code>, <code>f.cumulativeProbability(x)</code> |
| computes <code>P(X <= x)</code> where <code>X</code> is a random variable distributed |
| as <code>F</code>. The complement of the cumulative probability, |
| <code>f.survivalProbability(x)</code> computes <code>P(X > x)</code>. Note that |
| the survival probability is approximately equal to <code>1 - P(X <= x)</code> but |
| does not suffer from cancellation error as the cumulative probability approaches 1. |
| The cancellation error may cause a (total) loss of accuracy when |
| <code>P(X <= x) ~ 1</code> |
| (see <a href="#complements">complementary probabilities</a>). |
| </p> |
| <source class="prettyprint"> |
| TDistribution t = TDistribution.of(29); |
| double lowerTail = t.cumulativeProbability(-2.656); // P(T(29) <= -2.656) |
| double upperTail = t.survivalProbability(2.75); // P(T(29) > 2.75) |
| </source> |
| <p> |
| For <a href="../commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/DiscreteDistribution.html">discrete</a> |
| <code>F</code>, the probability mass function is given by <code>f.probability(x)</code>. |
| For <a href="../commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/ContinuousDistribution.html">continuous</a> |
| <code>F</code>, the probability density function is given by <code>f.density(x)</code>. |
| Distributions also implement <code>f.probability(x1, x2)</code> for computing |
| <code>P(x1 < X <= x2)</code>. |
| </p> |
| <source class="prettyprint"> |
| PoissonDistribution pd = PoissonDistribution.of(1.23); |
| double p1 = pd.probability(5); |
| double p2 = pd.probability(5, 5); |
| double p3 = pd.probability(4, 5); |
| // p2 == 0 |
| // p1 == p3 |
| </source> |
| <p> |
| Inverse distribution functions can be computed using the |
| <code>inverseCumulativeProbability</code> and <code>inverseSurvivalProbability</code> |
| methods. For continuous <code>f</code> and <code>p</code> a probability, |
| <code>f.inverseCumulativeProbability(p)</code> returns |
| </p> |
| <p> |
| \[ x = \begin{cases} |
| \inf \{ x \in \mathbb R : P(X \le x) \ge p\} & \text{for } 0 \lt p \le 1 \\ |
| \inf \{ x \in \mathbb R : P(X \le x) \gt 0 \} & \text{for } p = 0 |
| \end{cases} \] |
| </p> |
| <p> |
| where <code>X</code> is distributed as <code>F</code>.<br/> |
| Likewise <code>f.inverseSurvivalProbability(p)</code> returns |
| </p> |
| <p> |
| \[ x = \begin{cases} |
| \inf \{ x \in \mathbb R : P(X \gt x) \le p\} & \text{for } 0 \le p \lt 1 \\ |
| \inf \{ x \in \mathbb R : P(X \gt x) \lt 1 \} & \text{for } p = 1 |
| \end{cases} \] |
| </p> |
| <source class="prettyprint"> |
| NormalDistribution n = NormalDistribution.of(0, 1); |
| double x1 = n.inverseCumulativeProbability(1e-300); |
| double x2 = n.inverseSurvivalProbability(1e-300); |
| // x1 == -x2 ~ -37.0471 |
| </source> |
| <p> |
| For discrete <code>F</code>, the definition is the same, with \( \mathbb Z \) |
| (the integers) in place of \( \mathbb R \). Note that, in the discrete case, |
| the strict inequality on \( p \) in the definition can make a difference when |
| \( p \) is an attained value of the distribution. For example moving to the next |
| larger value of \( p \) will return the value \( x + 1 \) for inverse CDF. |
| </p> |
| <p> |
| All distributions provide accessors for the parameters used to create the distribution, |
| and a mean and variance. The return value when the mean or variance |
| is undefined is noted in the class javadoc. |
| </p> |
| <source class="prettyprint"> |
| ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); |
| double df = chi2.getDegreesOfFreedom(); // 42 |
| double mean = chi2.getMean(); // 42 |
| double variance = chi2.getVariance(); // 84 |
| |
| CauchyDistribution cauchy = CauchyDistribution.of(1.23, 4.56); |
| double location = cauchy.getLocation(); // 1.23 |
| double scale = cauchy.getScale(); // 4.56 |
| double undefined1 = cauchy.getMean(); // NaN |
| double undefined2 = cauchy.getVariance(); // NaN |
| </source> |
| <p> |
| The supported domain of the distribution is provided by the |
| <code>getSupportLowerBound</code> and <code>getSupportUpperBound</code> methods. |
| </p> |
| <source class="prettyprint"> |
| BinomialDistribution b = BinomialDistribution.of(13, 0.15); |
| int lower = b.getSupportLowerBound(); // 0 |
| int upper = b.getSupportUpperBound(); // 13 |
| </source> |
| <p> |
| All distributions implement a <code>createSampler(UniformRandomProvider rng)</code> |
| method to support random sampling from the distribution, where <code>UniformRandomProvider</code> |
| is an interface defined in <a href="https://commons.apache.org/proper/commons-rng/">Commons RNG</a>. |
| The sampler is a functional interface whose functional method is <code>sample()</code>, |
| suitable for generation of <code>double</code> or <code>int</code> samples. |
| Default <code>samples()</code> methods are provided to create a |
| <code>DoubleStream</code> or <code>IntStream</code>. |
| </p> |
| <source class="prettyprint"> |
| // From Commons RNG Simple |
| UniformRandomProvider rng = RandomSource.KISS.create(123L); |
| |
| NormalDistribution n = NormalDistribution.of(0, 1); |
| double x = n.createSampler(rng).sample(); |
| |
| // Generate a number of samples |
| GeometricDistribution g = GeometricDistribution.of(0.75); |
| int[] k = g.createSampler(rng).samples(100).toArray(); |
| // k.length == 100 |
| </source> |
| <p> |
| Note that even when distributions are immutable, the sampler is not immutable as it |
| depends on the instance of the mutable <code>UniformRandomProvider</code>. Generation of |
| many samples in a multi-threaded application should use a separate instance of |
| <code>UniformRandomProvider</code> per thread. Any synchronization should be avoided |
| for best performance. By default the streams returned from the <code>samples()</code> |
| methods are sequential. |
| </p> |
| </subsection> |
| <subsection name="Implementation Details" id="dist_imp_details"> |
| <p> |
| Instances are constructed using factory methods, typically a static method in the |
| distribution class named <code>of</code>. This allows the returned instance |
| to be specialised to the distribution parameters. |
| </p> |
| <p> |
| Exceptions will be raised by the factory method when constructing the distribution |
| using invalid parameters. See the class javadoc for exception conditions. |
| </p> |
| <p> |
| Unless otherwise noted, distribution instances are immutable. This allows sharing |
| an instance between threads for computations. |
| </p> |
| <p> |
| Exceptions will not be raised by distributions for an invalid <code>x</code> argument |
| to probability functions. Typically the cumulative probability functions will return |
| 0 or 1 for an out-of-domain argument, depending on which the side of the domain bound |
| the argument falls, and the density or probability mass functions return 0. |
| Return values for <code>x</code> arguments when the result is |
| undefined should be documented in the class javadoc. For example the beta distribution |
| is undefined for <code>x = 0, alpha < 1</code> or <code>x = 1, beta < 1</code>. |
| Note: This out-of-domain behaviour may be different from distributions in the |
| <code>org.apache.commons.math3.distribution</code> package. Users upgrading from |
| <code><a href="https://commons.apache.org/proper/commons-math/">commons-math</a></code> |
| should check the appropriate class javadoc. |
| </p> |
| <p> |
| An exception will be raised by distributions for an invalid <code>p</code> argument |
| to inverse probability functions. The argument must be in the range <code>[0, 1]</code>. |
| </p> |
| </subsection> |
| <subsection name="Complementary Probabilities" id="dist_complements"> |
| <p> |
| The distributions provide the cumulative probability <code>p</code> and its complement, |
| the survival probability, <code>q = 1 - p</code>. When the probability |
| <code>q</code> is small use of the cumulative probability to compute <code>q</code> can |
| result in dramatic loss of accuracy. This is due to the distribution of floating-point |
| numbers having a |
| <a href="https://en.wikipedia.org/wiki/Reciprocal_distribution">log-uniform</a> |
| distribution as the limiting distribution. There are far more |
| representable numbers as the probability value approaches zero than when it approaches |
| one. |
| </p> |
| <p> |
| The difference is illustrated with the result of computing the upper tail of a |
| probability distribution. |
| </p> |
| <source class="prettyprint"> |
| ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); |
| double q1 = 1 - chi2.cumulativeProbability(168); |
| double q2 = chi2.survivalProbability(168); |
| // q1 == 0 |
| // q2 != 0 |
| </source> |
| <p> |
| In this case the value <code>1 - p</code> has only a single bit of information as |
| <code>x</code> approaches 168. For example the value <code>1 - p(x=167)</code> |
| is <code>2<sup>-53</sup></code> (or approximately <code>1.11e-16</code>). |
| The complement <code>q</code> retains information |
| much further into the long tail as shown in the following table: |
| </p> |
| <table border="1" style="width: auto"> |
| <tr><th colspan="3"><font size="+1">Chi-squared distribution, 42 degrees of freedom</font></th></tr> |
| <tr><th>x</th><th>1 - p</th><th>q</th></tr> |
| <tr><td>166</td><td>1.11e-16</td><td>1.16e-16</td></tr> |
| <tr><td>167</td><td>1.11e-16</td><td>7.96e-17</td></tr> |
| <tr><td>168</td><td>0</td><td>5.43e-17</td></tr> |
| <tr><td>...</td><td></td><td></td></tr> |
| <tr><td>200</td><td>0</td><td>1.19e-22</td></tr> |
| </table> |
| <p> |
| Probability computations should use the appropriate cumulative or survival function |
| to calculate the lower or upper tail respectively. The same care should be applied |
| when inverting probability distributions. It is preferred to compute either |
| <code>p ≤ 0.5</code> or <code>q ≤ 0.5</code> without loss of accuracy and then |
| invert respectively the cumulative probability using <code>p</code> or the survival |
| probabilty using <code>q</code> to obtain <code>x</code>. |
| </p> |
| <source class="prettyprint"> |
| ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); |
| double q = 5.43e-17; |
| // Incorrect: p = 1 - q == 1.0 !!! |
| double x1 = chi2.inverseCumulativeProbability(1 - q); |
| // Correct: invert q |
| double x2 = chi2.inverseSurvivalProbability(q); |
| // x1 == +infinity |
| // x2 ~ 168.0 |
| </source> |
| <p> |
| Note: The survival probability functions were not present in the |
| <code>org.apache.commons.math3.distribution</code> package. Users upgrading from |
| <code><a href="https://commons.apache.org/proper/commons-math/">commons-math</a></code> |
| should update usage of the cumulative probability functions where appropriate. |
| </p> |
| </subsection> |
| </section> |
| |
| <section name="Inference" id="inference"> |
| <p> |
| The <code>commons-statistics-inference</code> module provides hypothesis testing. |
| </p> |
| <subsection name="Overview" id="inference_overview"> |
| <p> |
| The module provides test classes that implement a single, or family, of statistical |
| tests. Each test class provides methods to compute a test statistic and a p-value for the |
| significance of the statistic. These can be computed together using a <code>test</code> |
| method and returned as a |
| <a href="../commons-statistics-inference/apidocs/org/apache/commons/statistics/inference/DiscreteDistribution.html">SignificanceResult</a>. |
| The <code>SignificanceResult</code> has a method that can be used to <code>reject</code> |
| the null hypothesis at the provided significance level. Test classes may extend the |
| <code>SignificanceResult</code> to return more information about the test result, |
| for example the computed degrees of freedom. |
| </p> |
| <p> |
| Alternatively a <code>statistic</code> method is provided to compute <i>only</i> the |
| statistic as a <code>double</code> value. This statistic can be compared to a pre-computed |
| critical value, for example from a table of critical values. |
| </p> |
| <p> |
| A test is obtained using the <code>withDefaults()</code> method to return the test with |
| all options set to their default value. Any test options can be configured using |
| property change methods to return a new instance of the test. Tests that support an |
| <a href="../commons-statistics-inference/apidocs/org/apache/commons/statistics/inference/AlternativeHypothesis.html"> |
| alternate hypothesis</a> will use a two-sided test by default. Test that support multiple |
| <a href="../commons-statistics-inference/apidocs/org/apache/commons/statistics/inference/PValueMethod.html"> |
| p-value methods</a> will default to an appropriate computation for the size of the input |
| data. Unless otherwise noted test instances are immutable. |
| </p> |
| </subsection> |
| <subsection name="Examples" id="inference_examples"> |
| <p> |
| A chi-square test that the observed counts conform to the expected frequencies. |
| </p> |
| <source class="prettyprint"> |
| double[] expected = {0.25, 0.5, 0.25}; |
| long[] observed = {57, 123, 38}; |
| |
| SignificanceResult result = ChiSquareTest.withDefaults() |
| .test(expected, observed); |
| result.getPValue(); // 0.0316148 |
| result.reject(0.05); // true |
| result.reject(0.01); // false |
| </source> |
| <p> |
| A paired t-test that the student's marks in the math exam were greater than the science |
| exam. This fails to reject the null hypothesis (that there was no difference) with |
| 95% confidence. |
| </p> |
| <source class="prettyprint"> |
| double[] math = {53, 69, 65, 65, 67, 79, 86, 65, 62, 69}; // mean = 68.0 |
| double[] science = {75, 65, 68, 63, 55, 65, 73, 45, 51, 52}; // mean = 61.2 |
| |
| SignificanceResult result = TTest.withDefaults() |
| .with(AlternativeHypothesis.GREATER_THAN) |
| .pairedTest(math, science); |
| result.getPValue(); // 0.05764 |
| result.reject(0.05); // false |
| </source> |
| <p> |
| A G-test that the allele frequencies conform to the expected Hardy-Weinberg proportions. |
| This is an example of an intrinsic hypothesis where the expected frequencies are computed |
| using the observations and the degrees of freedom must be adjusted. |
| The data is from McDonald (1989) Selection component analysis |
| of the Mpi locus in the amphipod Platorchestia platensis. |
| <i>Heredity</i> <b>62</b>: 243-249. |
| </p> |
| <source class="prettyprint"> |
| // Allele frequencies: Mpi 90/90, Mpi 90/100, Mpi 100/100 |
| long[] observed = {1203, 2919, 1678}; |
| // Mpi 90 proportion |
| double p = (2.0 * observed[0] + observed[1]) / |
| (2 * Arrays.stream(observed).sum()); // 5325 / 11600 = 0.459 |
| |
| // Hardy-Weinberg proportions |
| double[] expected = {p * p, 2 * p * (1 - p), (1 - p) * (1 - p)}; |
| // 0.211, 0.497, 0.293 |
| |
| SignificanceResult result = GTest.withDefaults() |
| .withDegreesOfFreedomAdjustment(1) |
| .test(expected, observed); |
| result.getStatistic(); // 1.03 |
| result.getPValue(); // 0.309 |
| result.reject(0.05); // false |
| </source> |
| <p> |
| A one-way analysis of variance test. This is an example where the result has more |
| information than the test statistic and the p-value. |
| The data is from McDonald <i>et al</i> (1991) Allozymes and morphometric characters of |
| three species of Mytilus in the Northern and Southern Hemispheres. |
| <i>Marine Biology</i> <b>111</b>: 323-333. |
| </p> |
| <source class="prettyprint"> |
| double[] tillamook = {0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735, 0.0659, 0.0923, 0.0836}; |
| double[] newport = {0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835, 0.0725}; |
| double[] petersburg = {0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105}; |
| double[] magadan = {0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689}; |
| double[] tvarminne = {0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045}; |
| |
| Collection<double[]> data = Arrays.asList(tillamook, newport, petersburg, magadan, tvarminne); |
| OneWayAnova.Result result = OneWayAnova.withDefaults() |
| .test(data); |
| result.getStatistic(); // 7.12 |
| result.getPValue(); // 2.8e-4 |
| result.reject(0.001); // true |
| </source> |
| <p> |
| The result also provides the between and within group degrees of freedom and the mean |
| squares allowing reporting of the results in a table: |
| </p> |
| <table> |
| <tr><th></th><th>degrees of freedom</th><th>mean square</th><th>F</th><th>p</th></tr> |
| <tr><td>between groups</td><td>4</td><td>0.001113</td><td>7.12</td><td>2.8e-4</td></tr> |
| <tr><td>within groups</td><td>34</td><td>0.000159</td><td></td><td></td></tr> |
| </table> |
| </subsection> |
| </section> |
| <section name="Ranking" id="ranking"> |
| <p> |
| The <code>commons-statistics-ranking</code> module provides rank transformations. |
| </p> |
| <p> |
| The <code>NaturalRanking</code> class provides a ranking based on the natural ordering |
| of floating-point values. Ranks are assigned to the input numbers in ascending order, |
| starting from 1. |
| </p> |
| <source class="prettyprint"> |
| NaturalRanking ranking = new NaturalRanking(); |
| ranking.apply(new double[] {5, 6, 7, 8}); // 1, 2, 3, 4 |
| ranking.apply(new double[] {8, 5, 7, 6}); // 4, 1, 3, 2 |
| </source> |
| <p> |
| The special case of <code>NaN</code> values are handled using the configured |
| <code>NaNStragegy</code>. The default is to raise an exception. |
| </p> |
| <source class="prettyprint"> |
| double[] data = new double[] {6, 5, Double.NaN, 7}; |
| new NaturalRanking().apply(data); // IllegalArgumentException |
| new NaturalRanking(NaNStrategy.MINIMAL).apply(data); // (4, 2, 1, 3) |
| new NaturalRanking(NaNStrategy.MAXIMAL).apply(data); // (3, 1, 4, 2) |
| new NaturalRanking(NaNStrategy.REMOVED).apply(data); // (3, 1, 2) |
| new NaturalRanking(NaNStrategy.FIXED).apply(data); // (3, 1, NaN, 2) |
| new NaturalRanking(NaNStrategy.FAILED).apply(data); // IllegalArgumentException |
| </source> |
| <p> |
| Ties are handled using the configured <code>TiesStragegy</code>. The default is to |
| use an average. |
| </p> |
| <source class="prettyprint"> |
| double[] data = new double[] {7, 5, 7, 6}; |
| new NaturalRanking(.apply(data); // (3.5, 1, 3.5, 2) |
| new NaturalRanking(TiesStrategy.SEQUENTIAL).apply(data); // (3, 1, 4, 2) |
| new NaturalRanking(TiesStrategy.MINIMUM).apply(data); // (3, 1, 3, 2) |
| new NaturalRanking(TiesStrategy.MAXIMUM).apply(data); // (4, 1, 4, 2) |
| new NaturalRanking(TiesStrategy.AVERAGE).apply(data); // (3.5, 1, 3.5, 2) |
| new NaturalRanking(TiesStrategy.RANDOM).apply(data); // (3, 1, 4, 2) or (4, 1, 3, 2) |
| </source> |
| <p> |
| The source of randomness defaults to a system supplied generator. The randomness can be |
| provided as a <code>LongSupplier</code> of random 64-bit values. |
| </p> |
| <source class="prettyprint"> |
| double[] data = new double[] {7, 5, 7, 6}; |
| new NaturalRanking(TiesStrategy.RANDOM).apply(data); |
| new NaturalRanking(new SplittableRandom()::nextInt).apply(data); |
| // From Commons RNG |
| UniformRandomProvider rng = RandomSource.KISS.create(); |
| new NaturalRanking(rng::nextInt).apply(data); |
| // ranks: (3, 1, 4, 2) or (4, 1, 3, 2) |
| </source> |
| </section> |
| |
| </body> |
| |
| </document> |