| /* ----------------------------------------------------------------------- */ |
| /** |
| * |
| * @file hypothesis_tests.sql_in |
| * |
| * @brief SQL functions for statistical hypothesis tests |
| * |
| * @sa For an overview of hypthesis-test functions, see the module |
| * description \ref grp_stats_tests. |
| * |
| */ |
| /* ----------------------------------------------------------------------- */ |
| |
| m4_include(`SQLCommon.m4') |
| m4_changequote(<!,!>) |
| |
| /** |
| @addtogroup grp_stats_tests |
| |
| <div class="toc"><b>Contents</b> |
| <ul> |
| <li><a href="#input">Input</a></li> |
| <li><a href="#usage">Usage</a></li> |
| <li><a href="#examples">Examples</a></li> |
| <li><a href="#literature">Literature</a></li> |
| <li><a href="#related">Related Topics</a></li> |
| </ul> |
| </div> |
| |
| @brief Provides functions to perform statistical hypothesis tests. |
| |
| Hypothesis tests are used to confirm or reject a <em>null hypothesis</em> |
| \f$ H_0 \f$ about the distribution of random variables, given realizations of |
| these random variables. Since in general it is not possible to make statements |
| with certainty, one is interested in the probability \f$ p \f$ of seeing random |
| variates at least as extreme as the ones observed, assuming that \f$ H_0 \f$ is |
| true. If this probability \f$ p \f$ is small, \f$ H_0 \f$ will be rejected by |
| the test with <em>significance level</em> \f$ p \f$. Falsifying \f$ H_0 \f$ is |
| the canonic goal when employing a hypothesis test. That is, hypothesis tests are |
| typically used in order to substantiate that instead the <em>alternative |
| hypothesis</em> \f$ H_1 \f$ is true. |
| |
| Hypothesis tests may be divided into parametric and non-parametric tests. A |
| parametric test assumes certain distributions and makes inferences about |
| parameters of the distributions (e.g., the mean of a normal distribution). |
| Formally, there is a given domain of possible parameters \f$ \Gamma \f$ and the |
| null hypothesis \f$ H_0 \f$ is the event that the true parameter |
| \f$ \gamma_0 \in \Gamma_0 \f$, where \f$ \Gamma_0 \subsetneq \Gamma \f$. |
| Non-parametric tests, on the other hand, do not assume any particular |
| distribution of the sample (e.g., a non-parametric test may simply test if two |
| distributions are similar). |
| |
| The first step of a hypothesis test is to compute a <em>test statistic</em>, |
| which is a function of the random variates, i.e., a random variate itself. |
| A hypothesis test relies on the distribution of the test statistic being |
| (approximately) known. Now, the \f$ p \f$-value is the probability of seeing a |
| test statistic at least as extreme as the one observed, assuming that |
| \f$ H_0 \f$ is true. In a case where the null hypothesis corresponds to a family |
| of distributions (e.g., in a parametric test where \f$ \Gamma_0 \f$ is not a |
| singleton set), the \f$ p \f$-value is the supremum, over all possible |
| distributions according to the null hypothesis, of these probabilities. |
| |
| @note Please refer to \ref hypothesis_tests.sql_in for additional technical |
| information on the MADlib implementation of hypothesis tests, and for |
| detailed function signatures for all tests. |
| |
| @anchor input |
| @input |
| |
| Input data is assumed to be normalized with all values stored row-wise. In |
| general, the following inputs are expected. |
| |
| <b>One-sample tests</b> expect the following form: |
| <pre>{TABLE|VIEW} <em>source</em> ( |
| ... |
| <em>value</em> DOUBLE PRECISION |
| ... |
| )</pre> |
| |
| <b>Two-sample tests</b> expect the following form: |
| <pre>{TABLE|VIEW} <em>source</em> ( |
| ... |
| <em>first</em> BOOLEAN, |
| <em>value</em> DOUBLE PRECISION |
| ... |
| )</pre> |
| The \c first column indicates whether a value is from the first sample (if \c TRUE) or the |
| second sample (if \c FALSE). |
| |
| <b>Many-sample tests</b> expect the following form: |
| <pre>{TABLE|VIEW} <em>source</em> ( |
| ... |
| <em>group</em> INTEGER, |
| <em>value</em> DOUBLE PRECISION |
| ... |
| )</pre> |
| |
| @anchor usage |
| @usage |
| |
| All tests are implemented as aggregate functions. The non-parametric |
| (rank-based) tests are implemented as ordered aggregate functions and thus |
| necessitate an <tt>ORDER BY</tt> clause. In the following, the most simple |
| forms of usage are given. Specific function signatures, as described in |
| \ref hypothesis_tests.sql_in, may require more arguments or a different |
| <tt>ORDER BY</tt> clause. |
| |
| - Run a parametric one-sample test: |
| <pre>SELECT <em>test</em>(<em>value</em>) FROM <em>source</em></pre> |
| where '<em>test</em>' can be one of |
| - <tt>t_test_one</tt> (one-sample or dependent paired Student's t-test) |
| - <tt>chi2_gof_test</tt> (Pearson's chi-squared goodness of fit test, also used for chi-squared independence test as shown in example section below) |
| |
| - Run a parametric two-sample/multi-sample test: |
| <pre>SELECT <em>test</em>(<em>first/group</em>, <em>value</em>) FROM <em>source</em></pre> |
| where '<em>test</em>' can be one of |
| - <tt>f_test</tt> (Fisher F-test) |
| - <tt>t_test_two_pooled</tt> (two-sample pooled Student’s t-test, i.e. equal variances) |
| - <tt>t_test_two_unpooled</tt> (two-sample unpooled t-test, i.e., unequal variances, also known as Welch's t-test) |
| - <tt>one_way_anova</tt> (one-way analysis of variance, multi-sample) |
| |
| - Run a non-parametric two-sample/multi-sample test: |
| <pre>SELECT <em>test</em>(<em>first/group</em>, <em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre> |
| where '<em>test</em>' can be one of |
| - <tt>ks_test</tt> (Kolmogorov-Smirnov test) |
| - <tt>mw_test</tt> (Mann-Whitney test) |
| - <tt>wsr_test</tt> (Wilcoxon signed-rank test, multi-sample) |
| |
| <b>Note on non-parametric tests:</b> Kolomogov-Smirnov two-sample test is based on the asymptotic theory. |
| The p-value is given by comparing the test statistics with the Kolomogov distribution. |
| The p-value is also adjusted for data with heavy tail distribution, which may give |
| different results than those given by R function's ks.test. See [3] for a detailed explanation. |
| The literature is not unanimous about the definitions of the Wilcoxon rank sum |
| and Mann-Whitney tests. There are two possible definitions for the statistic; |
| MADlib outputs the minimum of the two and uses it for significance testing. This |
| might give different results for both mw_test and wsr_test compared to statistical |
| functions in other popular packages (like R's wilcox.test function). See [4] for |
| a detailed explanation. |
| |
| @anchor examples |
| @examp |
| |
| - <b>One-sample and two-sample t-test</b> (data is subset of mpg data from |
| <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm">NIST/SEMATECH</a>) |
| |
| <pre class="example"> |
| -- Load data |
| DROP TABLE IF EXISTS auto83b; |
| CREATE TABLE auto83b ( |
| id SERIAL, |
| mpg_us DOUBLE PRECISION, |
| mpg_j DOUBLE PRECISION |
| ); |
| COPY auto83b (mpg_us, mpg_j) FROM stdin DELIMITER '|'; |
| 18|24 |
| 15|27 |
| 18|27 |
| 16|25 |
| 17|31 |
| 15|35 |
| 14|24 |
| 14|19 |
| 21|31 |
| 10|32 |
| 10|24 |
| 11|26 |
| 9| 9 |
| \\N|32 |
| \\N|37 |
| \\N|38 |
| \\N|34 |
| \\N|34 |
| \\N|32 |
| \\N|33 |
| \\N|32 |
| \\N|25 |
| \\N|24 |
| \\N|37 |
| 13|\\N |
| 12|\\N |
| 18|\\N |
| 21|\\N |
| 19|\\N |
| 21|\\N |
| 15|\\N |
| 16|\\N |
| 15|\\N |
| 11|\\N |
| 20|\\N |
| 21|\\N |
| 19|\\N |
| 15|\\N |
| \\. |
| </pre> |
| |
| <pre class="example"> |
| -- Create table for one sample tests |
| DROP TABLE IF EXISTS auto83b_one_sample; |
| CREATE TABLE auto83b_one_sample AS |
| SELECT mpg_us AS mpg |
| FROM auto83b |
| WHERE mpg_us is not NULL; |
| -- Print table |
| SELECT * FROM auto83b_one_sample; |
| </pre> |
| |
| <pre class="result"> |
| mpg |
| 18 |
| 15 |
| 18 |
| 16 |
| 17 |
| 15 |
| 14 |
| 14 |
| 21 |
| 10 |
| 10 |
| 11 |
| 9 |
| 13 |
| 12 |
| 18 |
| 21 |
| 19 |
| 21 |
| 15 |
| 16 |
| 15 |
| 11 |
| 20 |
| 21 |
| 19 |
| 15 |
| (27 rows) |
| </pre> |
| <pre class="example"> |
| -- Create table for two sample tests |
| DROP TABLE IF EXISTS auto83b_two_sample; |
| CREATE TABLE auto83b_two_sample AS |
| SELECT TRUE AS is_us, mpg_us AS mpg |
| FROM auto83b |
| WHERE mpg_us is not NULL |
| UNION ALL |
| SELECT FALSE, mpg_j |
| FROM auto83b |
| WHERE mpg_j is not NULL; |
| -- Print table |
| SELECT * FROM auto83b_two_sample; |
| </pre> |
| <pre class="result"> |
| is_us | mpg |
| -------+----- |
| t | 18 |
| t | 15 |
| t | 18 |
| t | 16 |
| t | 17 |
| t | 15 |
| t | 14 |
| t | 14 |
| t | 21 |
| t | 10 |
| t | 10 |
| t | 11 |
| t | 9 |
| t | 13 |
| t | 12 |
| t | 18 |
| t | 21 |
| t | 19 |
| t | 21 |
| t | 15 |
| t | 16 |
| t | 15 |
| t | 11 |
| t | 20 |
| t | 21 |
| t | 19 |
| t | 15 |
| f | 24 |
| f | 27 |
| f | 27 |
| f | 25 |
| f | 31 |
| f | 35 |
| f | 24 |
| f | 19 |
| f | 31 |
| f | 32 |
| f | 24 |
| f | 26 |
| f | 9 |
| f | 32 |
| f | 37 |
| f | 38 |
| f | 34 |
| f | 34 |
| f | 32 |
| f | 33 |
| f | 32 |
| f | 25 |
| f | 24 |
| f | 37 |
| (51 rows) |
| </pre> |
| <pre class="example"> |
| -- One sample tests |
| SELECT (madlib.t_test_one(mpg - 20)).* FROM auto83b_one_sample; -- test rejected for mean = 20 |
| </pre> |
| |
| <pre class="result"> |
| statistic | df | p_value_one_sided | p_value_two_sided |
| ------------------+----+-------------------+---------------------- |
| -6.0532478722666 | 26 | 0.999998926789141 | 2.14642171769697e-06 |
| </pre> |
| |
| <pre class="example"> |
| SELECT (madlib.t_test_one(mpg - 15.7)).* FROM auto83b_one_sample; -- test not rejected |
| </pre> |
| |
| <pre class="result"> |
| statistic | df | p_value_one_sided | p_value_two_sided |
| ---------------------+----+-------------------+------------------- |
| 0.00521831713126531 | 26 | 0.497938118950661 | 0.995876237901321 |
| </pre> |
| |
| <pre class="example"> |
| -- Two sample tests |
| SELECT (madlib.t_test_two_pooled(is_us, mpg)).* FROM auto83b_two_sample; |
| </pre> |
| <pre class="result"> |
| statistic | df | p_value_one_sided | p_value_two_sided |
| -------------------+----+-------------------+---------------------- |
| -8.89342267075968 | 49 | 0.999999999995748 | 8.50408632402377e-12 |
| </pre> |
| |
| <pre class="example"> |
| SELECT (madlib.t_test_two_unpooled(is_us, mpg)).* FROM auto83b_two_sample; |
| </pre> |
| |
| <pre class="result"> |
| statistic | df | p_value_one_sided | p_value_two_sided |
| -------------------+------------------+-------------------+---------------------- |
| -8.61746388524314 | 35.1283818346179 | 0.999999999821218 | 3.57563867403599e-10 |
| </pre> |
| |
| - <b>F-Test</b> (Uses same data as above t-test) |
| |
| <pre class="example"> |
| SELECT (madlib.f_test(is_us, mpg)).* FROM auto83b_two_sample; |
| -- Test result indicates that the two distributions have different variances |
| </pre> |
| <pre class="result"> |
| statistic | df1 | df2 | p_value_one_sided | p_value_two_sided |
| -------------------+-----+-----+-------------------+--------------------- |
| 0.311786921089247 | 26 | 23 | 0.997559863672441 | 0.00488027265511803 |
| </pre> |
| |
| - <b>Chi-squared goodness-of-fit test</b> (<a href="http://www.statsdirect.com/help/default.htm#nonparametric_methods/chisq_goodness_fit.htm">Data source</a>) |
| |
| <pre class="example"> |
| CREATE TABLE chi2_test_blood_group ( |
| id SERIAL, |
| blood_group VARCHAR, |
| observed BIGINT, |
| expected DOUBLE PRECISION |
| ); |
| INSERT INTO chi2_test_blood_group(blood_group, observed, expected) VALUES |
| ('O', 67, 82.28), |
| ('A', 83, 84.15), |
| ('B', 29, 14.96), |
| ('AB', 8, 5.61); |
| SELECT (madlib.chi2_gof_test(observed, expected)).* FROM chi2_test_blood_group; |
| </pre> |
| <pre class="result"> |
| statistic | p_value | df | phi | contingency_coef |
| ------------------+----------------------+----+------------------+------------------- |
| 17.0481013341976 | 0.000690824622923826 | 3 | 2.06446732440826 | 0.899977280680593 |
| </pre> |
| |
| - <b>Chi-squared independence test</b> (<a href=http://itl.nist.gov/div898/software/dataplot/refman1/auxillar/chistest.htm>Data source</a>) |
| |
| The Chi-squared independence test uses the Chi-squared goodness-of-fit function, |
| as shown in the example below. The expected value needs to be computed and passed |
| to the goodness-of-fit function. The expected value for MADlib is computed as |
| <em>sum of rows * sum of columns</em>, for each element of the input matrix. |
| For e.g., expected value for element (2,1) would be <em>sum of row 2 * sum of column 1</em>. |
| |
| <pre class="example"> |
| CREATE TABLE chi2_test_friendly ( |
| id_x SERIAL, |
| values INTEGER[] |
| ); |
| INSERT INTO chi2_test_friendly(values) VALUES |
| (array[5, 29, 14, 16]), |
| (array[15, 54, 14, 10]), |
| (array[20, 84, 17, 94]), |
| (array[68, 119, 26, 7]); |
| |
| -- Input table is expected to be unpivoted, so need to pivot it |
| CREATE TABLE chi2_test_friendly_unpivoted AS |
| SELECT id_x, id_y, values[id_y] AS observed |
| FROM |
| chi2_test_friendly, |
| generate_series(1,4) AS id_y; |
| |
| -- Compute Chi-squared independence statistic, by calculating expected value in the SQL and calling the goodness-of-fit function |
| SELECT (madlib.chi2_gof_test(observed, expected, deg_freedom)).* |
| FROM ( |
| -- Compute expected values and degrees of freedom |
| SELECT |
| observed, |
| sum(observed) OVER (PARTITION BY id_x)::DOUBLE PRECISION * |
| sum(observed) OVER (PARTITION BY id_y) AS expected |
| FROM chi2_test_friendly_unpivoted |
| ) p, ( |
| SELECT |
| (count(DISTINCT id_x) - 1) * (count(DISTINCT id_y) - 1) AS deg_freedom |
| FROM chi2_test_friendly_unpivoted |
| ) q; |
| </pre> |
| <pre class="result"> |
| statistic | p_value | df | phi | contingency_coef |
| ------------------+----------------------+----+------------------+------------------- |
| 138.289841626008 | 2.32528678709871e-25 | 9 | 2.93991753313346 | 0.946730727519112 |
| </pre> |
| |
| - <b>ANOVA test</b> (<a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm">Data source</a>) |
| |
| <pre class="example"> |
| CREATE TABLE nist_anova_test ( |
| id SERIAL, |
| resistance FLOAT8[] |
| ); |
| INSERT INTO nist_anova_test(resistance) VALUES |
| (array[6.9,8.3,8.0]), |
| (array[5.4,6.8,10.5]), |
| (array[5.8,7.8,8.1]), |
| (array[4.6,9.2,6.9]), |
| (array[4.0,6.5,9.3]); |
| |
| SELECT (madlib.one_way_anova(level, value)).* FROM ( |
| SELECT level, resistance[level] AS value |
| FROM |
| nist_anova_test, (SELECT * FROM generate_series(1,3) level) q1 |
| ) q2; |
| </pre> |
| <pre class="result"> |
| sum_squares_between | sum_squares_within | df_between | df_within | mean_squares_between | mean_squares_within | statistic | p_value |
| ---------------------+--------------------+------------+-----------+----------------------+---------------------+------------------+-------------------- |
| 27.8973333333333 | 17.452 | 2 | 12 | 13.9486666666667 | 1.45433333333333 | 9.59110703644281 | 0.0032482226008593 |
| </pre> |
| |
| |
| - <b>Kolmogorov-Smirnov test</b> (<a href="http://www.physics.csbsju.edu/stats/KS-test.html">Data source</a>) |
| |
| <pre class="example"> |
| CREATE TABLE ks_sample_1 AS |
| SELECT |
| TRUE AS first, |
| unnest(ARRAY[0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09]) AS value |
| UNION ALL |
| SELECT |
| FALSE, |
| unnest(ARRAY[-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50]); |
| |
| SELECT (madlib.ks_test(first, value, |
| (SELECT count(value) FROM ks_sample_1 WHERE first), |
| (SELECT count(value) FROM ks_sample_1 WHERE NOT first) |
| ORDER BY value)).* |
| FROM ks_sample_1; |
| </pre> |
| <pre class="result"> |
| statistic | k_statistic | p_value |
| -----------+-----------------+-------------------- |
| 0.45 | 1.4926782214936 | 0.0232132758544496 |
| </pre> |
| |
| - <b>Mann-Whitney test</b> (use same data as t-test) |
| |
| <pre class="example"> |
| SELECT (madlib.mw_test(is_us, mpg ORDER BY mpg)).* from auto83b_two_sample; |
| -- Note first parameter above is BOOLEAN |
| </pre> |
| <pre class="result"> |
| statistic | u_statistic | p_value_one_sided | p_value_two_sided |
| -------------------+-------------+-------------------+---------------------- |
| -5.50097925755249 | 32.5 | 0.999999981115618 | 3.77687645883758e-08 |
| </pre> |
| |
| - <b>Wilcoxon signed-rank test</b> |
| |
| <pre class="example"> |
| DROP TABLE IF EXISTS test_wsr; |
| CREATE TABLE test_wsr ( |
| x DOUBLE PRECISION, |
| y DOUBLE PRECISION |
| ); |
| COPY test_wsr (x, y) FROM stdin DELIMITER '|'; |
| 0.32|0.39 |
| 0.4|0.47 |
| 0.11|0.11 |
| 0.47|0.43 |
| 0.32|0.42 |
| 0.35|0.3 |
| 0.32|0.43 |
| 0.63|0.98 |
| 0.5|0.86 |
| 0.6|0.79 |
| 0.38|0.33 |
| 0.46|0.45 |
| 0.2|0.22 |
| 0.31|0.3 |
| 0.62|0.6 |
| 0.52|0.53 |
| 0.77|0.85 |
| 0.23|0.21 |
| 0.3|0.33 |
| 0.7|0.57 |
| 0.41|0.43 |
| 0.53|0.49 |
| 0.19|0.2 |
| 0.31|0.35 |
| 0.48|0.4 |
| \\. |
| \s |
| SELECT (madlib.wsr_test( |
| x - y, |
| 2 * 2^(-52) * greatest(x,y) |
| ORDER BY abs(x - y) |
| )).* |
| FROM test_wsr; |
| </pre> |
| <pre class="result"> |
| statistic | rank_sum_pos | rank_sum_neg | num | z_statistic | p_value_one_sided | p_value_two_sided |
| -----------+--------------+--------------+-----+-------------------+-------------------+------------------- |
| 105.5 | 105.5 | 194.5 | 24 | -1.27318365656729 | 0.898523560667509 | 0.202952878664983 |
| </pre> |
| |
| |
| @anchor literature |
| @literature |
| |
| [1] M. Hollander, D. Wolfe: <em>Nonparametric Statistical Methods</em>, |
| 2nd edition, Wiley, 1999 |
| |
| [2] E. Lehmann, J. Romano: <em>Testing Statistical Hypotheses</em>, 3rd edition, |
| Springer, 2005 |
| |
| [3] M. Stephens: <em>Use of the Kolmogorov-Smirnov, Cramer-Von Mises and related statistics without extensive tables</em>, Journal of the Royal Statistical Society. Series B (Methodological) (1970): 115-122. |
| |
| [4] Wikipedia: Mann–Whitney U test calculation, http://en.wikipedia.org/wiki/Mann-Whitney_test#Calculations |
| |
| @anchor related |
| @par Related Topics |
| |
| File hypothesis_tests.sql_in documenting the SQL functions. |
| */ |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.t_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.t_test_result AS ( |
| statistic DOUBLE PRECISION, |
| df DOUBLE PRECISION, |
| p_value_one_sided DOUBLE PRECISION, |
| p_value_two_sided DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_transition( |
| state DOUBLE PRECISION[], |
| value DOUBLE PRECISION |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_merge_states( |
| state1 DOUBLE PRECISION[], |
| state2 DOUBLE PRECISION[]) |
| RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.t_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.f_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.f_test_result AS ( |
| statistic DOUBLE PRECISION, |
| df1 DOUBLE PRECISION, |
| df2 DOUBLE PRECISION, |
| p_value_one_sided DOUBLE PRECISION, |
| p_value_two_sided DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.f_test_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.f_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| /** |
| * @brief Perform one-sample or dependent paired Student t-test |
| * |
| * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables |
| * \f$ X_1, \dots, X_n \sim N(\mu, \sigma^2) \f$ with unknown parameters \f$ \mu \f$ and |
| * \f$ \sigma^2 \f$, test the null hypotheses \f$ H_0 : \mu \leq 0 \f$ and |
| * \f$ H_0 : \mu = 0 \f$. |
| * |
| * @param value Value of random variate \f$ x_i \f$ |
| * |
| * @return A composite value as follows. We denote by \f$ \bar x \f$ the |
| * sample mean and by \f$ s^2 \f$ the |
| * sample variance. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * t = \frac{\sqrt n \cdot \bar x}{s} |
| * \f] |
| * The corresponding random |
| * variable is Student-t distributed with |
| * \f$ (n - 1) \f$ degrees of freedom. |
| * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n - 1) \f$ |
| * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. |
| * In detail, the result is \f$ \Pr[\bar X \geq \bar x \mid \mu = 0] \f$, |
| * which is a lower bound on |
| * \f$ \Pr[\bar X \geq \bar x \mid \mu \leq 0] \f$. Computed as |
| * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., |
| * \f$ \Pr[ |\bar X| \geq |\bar x| \mid \mu = 0] \f$. Computed as |
| * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. |
| * |
| * @usage |
| * - One-sample t-test: Test null hypothesis that the mean of a sample is at |
| * most (or equal to, respectively) \f$ \mu_0 \f$: |
| * <pre>SELECT (t_test_one(<em>value</em> - <em>mu_0</em>)).* FROM <em>source</em></pre> |
| * - Dependent paired t-test: Test null hypothesis that the mean difference |
| * between the first and second value in each pair is at most (or equal to, |
| * respectively) \f$ \mu_0 \f$: |
| * <pre>SELECT (t_test_one(<em>first</em> - <em>second</em> - <em>mu_0</em>)).* |
| * FROM <em>source</em></pre> |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.t_test_one( |
| /*+ value */ DOUBLE PRECISION) ( |
| |
| SFUNC=MADLIB_SCHEMA.t_test_one_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.t_test_one_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_transition( |
| state DOUBLE PRECISION[], |
| "first" BOOLEAN, |
| "value" DOUBLE PRECISION) |
| RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_pooled_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.t_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| /** |
| * @brief Perform two-sample pooled (i.e., equal variances) Student t-test |
| * |
| * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of |
| * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma^2) \f$ and |
| * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters |
| * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses |
| * \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$. |
| * |
| * @param first Indicator whether \c value is from first sample |
| * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample |
| * \f$ y_1, \dots, y_m \f$ (if \c FALSE) |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ |
| * |
| * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ |
| * the sample means and by \f$ s_X^2, s_Y^2 \f$ the |
| * sample variances. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * t = \frac{\bar x - \bar y}{s_p \sqrt{1/n + 1/m}} |
| * \f] |
| * where |
| * \f[ |
| * s_p^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2 |
| * + \sum_{i=1}^m (y_i - \bar y)^2} |
| * {n + m - 2} |
| * \f] |
| * is the <em>pooled variance</em>. |
| * The corresponding random |
| * variable is Student-t distributed with |
| * \f$ (n + m - 2) \f$ degrees of freedom. |
| * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n + m - 2) \f$ |
| * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. |
| * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$, |
| * which is a lower bound on |
| * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$. |
| * Computed as |
| * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., |
| * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$. |
| * Computed as |
| * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. |
| * |
| * @usage |
| * - Two-sample pooled t-test: Test null hypothesis that the mean of the first |
| * sample is at most (or equal to, respectively) the mean of the second |
| * sample: |
| * <pre>SELECT (t_test_pooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_pooled( |
| /*+ "first" */ BOOLEAN, |
| /*+ "value" */ DOUBLE PRECISION) ( |
| |
| SFUNC=MADLIB_SCHEMA.t_test_two_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.t_test_two_pooled_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_unpooled_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.t_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| /** |
| * @brief Perform unpooled (i.e., unequal variances) t-test (also known as |
| * Welch's t-test) |
| * |
| * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of |
| * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma_X^2) \f$ and |
| * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma_Y^2) \f$ with unknown parameters |
| * \f$ \mu_X, \mu_Y, \sigma_X^2, \f$ and \f$ \sigma_Y^2 \f$, test the null |
| * hypotheses \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$. |
| * |
| * @param first Indicator whether \c value is from first sample |
| * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample |
| * \f$ y_1, \dots, y_m \f$ (if \c FALSE) |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ |
| * |
| * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ |
| * the sample means and by \f$ s_X^2, s_Y^2 \f$ the |
| * sample variances. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * t = \frac{\bar x - \bar y}{\sqrt{s_X^2/n + s_Y^2/m}} |
| * \f] |
| * The corresponding random variable is approximately Student-t distributed |
| * with |
| * \f[ |
| * \frac{(s_X^2 / n + s_Y^2 / m)^2}{(s_X^2 / n)^2/(n-1) + (s_Y^2 / m)^2/(m-1)} |
| * \f] |
| * degrees of freedom (Welch–Satterthwaite formula). |
| * - <tt>df FLOAT8</tt> - Degrees of freedom (as above) |
| * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. |
| * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$, |
| * which is a lower bound on |
| * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$. |
| * Computed as |
| * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., |
| * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$. |
| * Computed as |
| * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. |
| * |
| * @usage |
| * - Two-sample unpooled t-test: Test null hypothesis that the mean of the |
| * first sample is at most (or equal to, respectively) the mean of the second |
| * sample: |
| * <pre>SELECT (t_test_unpooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_unpooled( |
| /*+ "first" */ BOOLEAN, |
| /*+ "value" */ DOUBLE PRECISION) ( |
| |
| SFUNC=MADLIB_SCHEMA.t_test_two_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.t_test_two_unpooled_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| /** |
| * @brief Perform Fisher F-test |
| * |
| * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_n \f$ of |
| * i.i.d. random variables \f$ X_1, \dots, X_m \sim N(\mu_X, \sigma^2) \f$ and |
| * \f$ Y_1, \dots, Y_n \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters |
| * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses |
| * \f$ H_0 : \sigma_X < \sigma_Y \f$ and \f$ H_0 : \sigma_X = \sigma_Y \f$. |
| * |
| * @param first Indicator whether \c value is from first sample |
| * \f$ x_1, \dots, x_m \f$ (if \c TRUE) or from second sample |
| * \f$ y_1, \dots, y_n \f$ (if \c FALSE) |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ |
| * |
| * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ |
| * the sample means and by \f$ s_X^2, s_Y^2 \f$ the |
| * sample variances. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * f = \frac{s_Y^2}{s_X^2} |
| * \f] |
| * The corresponding random |
| * variable is F-distributed with |
| * \f$ (n - 1) \f$ degrees of freedom in the numerator and |
| * \f$ (m - 1) \f$ degrees of freedom in the denominator. |
| * - <tt>df1 BIGINT</tt> - Degrees of freedom in the numerator \f$ (n - 1) \f$ |
| * - <tt>df2 BIGINT</tt> - Degrees of freedom in the denominator \f$ (m - 1) \f$ |
| * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. |
| * In detail, the result is \f$ \Pr[F \geq f \mid \sigma_X = \sigma_Y] \f$, |
| * which is a lower bound on |
| * \f$ \Pr[F \geq f \mid \sigma_X \leq \sigma_Y] \f$. Computed as |
| * <tt>(1.0 - \ref fisher_f_cdf "fisher_f_cdf"(statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., |
| * \f$ 2 \cdot \min \{ p, 1 - p \} \f$ where |
| * \f$ p = \Pr[ F \geq f \mid \sigma_X = \sigma_Y] \f$. Computed as |
| * <tt>(min(p_value_one_sided, 1. - p_value_one_sided))</tt>. |
| * |
| * @usage |
| * - Test null hypothesis that the variance of the first sample is at most (or |
| * equal to, respectively) the variance of the second sample: |
| * <pre>SELECT (f_test(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> |
| * |
| * @internal We reuse the two-sample t-test transition and merge functions. |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.f_test( |
| /*+ "first" */ BOOLEAN, |
| /*+ "value" */ DOUBLE PRECISION) ( |
| |
| SFUNC=MADLIB_SCHEMA.t_test_two_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.f_test_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( |
| state DOUBLE PRECISION[], |
| observed BIGINT, |
| expected DOUBLE PRECISION, |
| df BIGINT |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( |
| state DOUBLE PRECISION[], |
| observed BIGINT, |
| expected DOUBLE PRECISION |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( |
| state DOUBLE PRECISION[], |
| observed BIGINT |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_merge_states( |
| state1 DOUBLE PRECISION[], |
| state2 DOUBLE PRECISION[]) |
| RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.chi2_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.chi2_test_result AS ( |
| statistic DOUBLE PRECISION, |
| p_value DOUBLE PRECISION, |
| df BIGINT, |
| phi DOUBLE PRECISION, |
| contingency_coef DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_final( |
| state DOUBLE PRECISION[] |
| ) RETURNS MADLIB_SCHEMA.chi2_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| /** |
| * @brief Perform Pearson's chi-squared goodness-of-fit test |
| * |
| * Let \f$ n_1, \dots, n_k \f$ be a realization of a (vector) random variable |
| * \f$ N = (N_1, \dots, N_k) \f$ that follows the multinomial distribution with |
| * parameters \f$ k \f$ and \f$ p = (p_1, \dots, p_k) \f$. Test the null |
| * hypothesis \f$ H_0 : p = p^0 \f$. |
| * |
| * @param observed Number \f$ n_i \f$ of observations of the current event/row |
| * @param expected Expected number of observations of current event/row. This |
| * number is not required to be normalized. That is, \f$ p^0_i \f$ will be |
| * taken as \c expected divided by <tt>sum(expected)</tt>. Hence, if this |
| * parameter is not specified, chi2_test() will by default use |
| * \f$ p^0 = (\frac 1k, \dots, \frac 1k) \f$, i.e., test that \f$ p \f$ is a |
| * discrete uniform distribution. |
| * @param df Degrees of freedom. This is the number of events reduced by the |
| * degree of freedom lost by using the observed numbers for defining the |
| * expected number of observations. If this parameter is 0, the degree |
| * of freedom is taken as \f$ (k - 1) \f$. |
| * |
| * @return A composite value as follows. Let \f$ n = \sum_{i=1}^n n_i \f$. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * \chi^2 = \sum_{i=1}^k \frac{(n_i - np_i)^2}{np_i} |
| * \f] |
| * The corresponding random |
| * variable is approximately chi-squared distributed with |
| * \c df degrees of freedom. |
| * - <tt>df BIGINT</tt> - Degrees of freedom |
| * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., |
| * \f$ \Pr[X^2 \geq \chi^2 \mid p = p^0] \f$. Computed as |
| * <tt>(1.0 - \ref chi_squared_cdf "chi_squared_cdf"(statistic))</tt>. |
| * - <tt>phi FLOAT8</tt> - Phi coefficient, i.e., |
| * \f$ \phi = \sqrt{\frac{\chi^2}{n}} \f$ |
| * - <tt>contingency_coef FLOAT8</tt> - Contingency coefficient, i.e., |
| * \f$ \sqrt{\frac{\chi^2}{n + \chi^2}} \f$ |
| * |
| * @usage |
| * - Test null hypothesis that all possible outcomes of a categorical variable |
| * are equally likely: |
| * <pre>SELECT (chi2_gof_test(<em>observed</em>, 1, NULL)).* FROM <em>source</em></pre> |
| * - Test null hypothesis that two categorical variables are independent. |
| * Such data is often shown in a <em>contingency table</em> (also known as |
| * \em crosstab). A crosstab is a matrix where possible values for the first |
| * variable correspond to rows and values for the second variable to |
| * columns. The matrix elements are the observation frequencies of the |
| * joint occurrence of the respective values. |
| * chi2_gof_test() assumes that the crosstab is stored in normalized form, |
| * i.e., there are three columns <tt><em>var1</em></tt>, |
| * <tt><em>var2</em></tt>, <tt><em>observed</em></tt>. |
| * <pre>SELECT (chi2_gof_test(<em>observed</em>, expected, deg_freedom)).* |
| *FROM ( |
| * SELECT |
| * <em>observed</em>, |
| * sum(<em>observed</em>) OVER (PARTITION BY var1)::DOUBLE PRECISION |
| * * sum(<em>observed</em>) OVER (PARTITION BY var2) AS expected |
| * FROM <em>source</em> |
| *) p, ( |
| * SELECT |
| * (count(DISTINCT <em>var1</em>) - 1) * (count(DISTINCT <em>var2</em>) - 1) AS deg_freedom |
| * FROM <em>source</em> |
| *) q;</pre> |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( |
| /*+ observed */ BIGINT, |
| /*+ expected */ DOUBLE PRECISION /*+ DEFAULT 1 */, |
| /*+ df */ BIGINT /*+ DEFAULT 0 */ |
| ) ( |
| SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0}' |
| ); |
| |
| CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( |
| /*+ observed */ BIGINT, |
| /*+ expected */ DOUBLE PRECISION |
| ) ( |
| SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( |
| /*+ observed */ BIGINT |
| ) ( |
| SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_transition( |
| state DOUBLE PRECISION[], |
| "first" BOOLEAN, |
| "value" DOUBLE PRECISION, |
| "numFirst" BIGINT, |
| "numSecond" BIGINT |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.ks_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.ks_test_result AS ( |
| statistic DOUBLE PRECISION, |
| k_statistic DOUBLE PRECISION, |
| p_value DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.ks_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| /** |
| * @brief Perform Kolmogorov-Smirnov test |
| * |
| * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of |
| * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d. |
| * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the |
| * underlying distributions function \f$ F_X, F_Y \f$ are identical, i.e., |
| * \f$ H_0 : F_X = F_Y \f$. |
| * |
| * @param first Determines whether the value belongs to the first |
| * (if \c TRUE) or the second sample (if \c FALSE) |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ |
| * @param m Size \f$ m \f$ of the first sample. See usage instructions below. |
| * @param n Size of the second sample. See usage instructions below. |
| * |
| * @return A composite value. |
| * - <tt>statistic FLOAT8</tt> - Kolmogorov–Smirnov statistic |
| * \f[ |
| * d = \max_{t \in \mathbb R} |F_x(t) - F_y(t)| |
| * \f] |
| * where \f$ F_x(t) := \frac 1m |\{ i \mid x_i \leq t \}| \f$ and |
| * \f$ F_y \f$ (defined likewise) are the empirical distribution functions. |
| * - <tt>k_statistic FLOAT8</tt> - Kolmogorov statistic |
| * \f$ |
| * k = (r + 0.12 + \frac{0.11}{r}) \cdot d |
| * \f$ |
| * where |
| * \f$ |
| * r = \sqrt{\frac{m n}{m+n}}. |
| * \f$ |
| * and \f$ d \f$ is the statistic. |
| * Then \f$ k \f$ is approximately Kolmogorov distributed. |
| * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., an approximate value |
| * for \f$ \Pr[D \geq d \mid F_X = F_Y] \f$. Computed as |
| * <tt>(1.0 - \ref kolmogorov_cdf "kolmogorov_cdf"(k_statistic))</tt>. |
| * |
| * @usage |
| * - Test null hypothesis that two samples stem from the same distribution: |
| * <pre>SELECT (ks_test(<em>first</em>, <em>value</em>, |
| * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE <em>first</em>), |
| * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE NOT <em>first</em>) |
| * ORDER BY <em>value</em> |
| *)).* FROM <em>source</em></pre> |
| * |
| * @note |
| * This aggregate must be used as an ordered aggregate |
| * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are |
| * not ordered. |
| */ |
| m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! |
| CREATE |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>) |
| AGGREGATE MADLIB_SCHEMA.ks_test( |
| /*+ "first" */ BOOLEAN, |
| /*+ "value" */ DOUBLE PRECISION, |
| /*+ m */ BIGINT, |
| /*+ n */ BIGINT |
| ) ( |
| SFUNC=MADLIB_SCHEMA.ks_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.ks_test_final, |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| !>) |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_transition( |
| state DOUBLE PRECISION[], |
| "first" BOOLEAN, |
| "value" DOUBLE PRECISION |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.mw_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.mw_test_result AS ( |
| statistic DOUBLE PRECISION, |
| u_statistic DOUBLE PRECISION, |
| p_value_one_sided DOUBLE PRECISION, |
| p_value_two_sided DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.mw_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| /** |
| * @brief Perform Mann-Whitney test |
| * |
| * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of |
| * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d. |
| * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the |
| * underlying distributions are equal, i.e., |
| * \f$ H_0 : \forall i,j: \Pr[X_i > Y_j] + \frac{\Pr[X_i = Y_j]}{2} = \frac 12 \f$. |
| * |
| * @param first Determines whether the value belongs to the first |
| * (if \c TRUE) or the second sample (if \c FALSE) |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ |
| * |
| * @return A composite value. |
| * - <tt>statistic FLOAT8</tt> - Statistic |
| * \f[ |
| * z = \frac{u - \bar x}{\sqrt{\frac{mn(m+n+1)}{12}}} |
| * \f] |
| * where \f$ u \f$ is the u-statistic computed as follows. The z-statistic |
| * is approximately standard normally distributed. |
| * - <tt>u_statistic FLOAT8</tt> - Statistic |
| * \f$ u = \min \{ u_x, u_y \} \f$ where |
| * \f[ |
| * u_x = mn + \binom{m+1}{2} - \sum_{i=1}^m r_{x,i} |
| * \f] |
| * where |
| * \f[ |
| * r_{x,i} |
| * = \{ j \mid x_j < x_i \} + \{ j \mid y_j < x_i \} + |
| * \frac{\{ j \mid x_j = x_i \} + \{ j \mid y_j = x_i \} + 1}{2} |
| * \f] |
| * is defined as the rank of \f$ x_i \f$ in the combined list of all |
| * \f$ m+n \f$ observations. For ties, the average rank of all equal values |
| * is used. |
| * - <tt>p_value_one_sided FLOAT8</tt> - Approximate one-sided p-value, i.e., |
| * an approximate value for \f$ \Pr[Z \geq z \mid H_0] \f$. Computed as |
| * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Approximate two-sided p-value, i.e., |
| * an approximate value for \f$ \Pr[|Z| \geq |z| \mid H_0] \f$. Computed as |
| * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>. |
| * |
| * @usage |
| * - Test null hypothesis that two samples stem from the same distribution: |
| * <pre>SELECT (mw_test(<em>first</em>, <em>value</em> ORDER BY <em>value</em>)).* FROM <em>source</em></pre> |
| * |
| * @note |
| * This aggregate must be used as an ordered aggregate |
| * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are |
| * not ordered. |
| */ |
| m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! |
| CREATE |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>) |
| AGGREGATE MADLIB_SCHEMA.mw_test( |
| /*+ "first" */ BOOLEAN, |
| /*+ "value" */ DOUBLE PRECISION |
| ) ( |
| SFUNC=MADLIB_SCHEMA.mw_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.mw_test_final, |
| INITCOND='{0,0,0,0,0,0,0}' |
| ); |
| !>) |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition( |
| state DOUBLE PRECISION[], |
| value DOUBLE PRECISION, |
| "precision" DOUBLE PRECISION |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition( |
| state DOUBLE PRECISION[], |
| value DOUBLE PRECISION |
| ) RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.wsr_test_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.wsr_test_result AS ( |
| statistic DOUBLE PRECISION, |
| rank_sum_pos FLOAT8, |
| rank_sum_neg FLOAT8, |
| num BIGINT, |
| z_statistic DOUBLE PRECISION, |
| p_value_one_sided DOUBLE PRECISION, |
| p_value_two_sided DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.wsr_test_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| /** |
| * @brief Perform Wilcoxon-Signed-Rank test |
| * |
| * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables |
| * \f$ X_1, \dots, X_n \f$ with unknown mean \f$ \mu \f$, test the null |
| * hypotheses \f$ H_0 : \mu \leq 0 \f$ and \f$ H_0 : \mu = 0 \f$. |
| * |
| * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$. Values of 0 |
| * are ignored (i.e., they do not count towards \f$ n \f$). |
| * @param precision The precision \f$ \epsilon_i \f$ with which value is known. |
| * The precision determines the handling of ties. The current value |
| * \f$ v_i \f$ is regarded a tie with the previous value \f$ v_{i-1} \f$ if |
| * \f$ v_i - \epsilon_i \leq \max_{j=1, \dots, i-1} v_j + \epsilon_j \f$. |
| * If \c precision is negative, then it will be treated as |
| * <tt>value * 2^(-52)</tt>. (Note that \f$ 2^{-52} \f$ is the machine |
| * epsilon for type <tt>DOUBLE PRECISION</tt>.) |
| * |
| * @return A composite value: |
| * - <tt>statistic FLOAT8</tt> - statistic computed as follows. Let |
| * \f$ |
| * w^+ = \sum_{i \mid x_i > 0} r_i |
| * \f$ |
| * and |
| * \f$ |
| * w^- = \sum_{i \mid x_i < 0} r_i |
| * \f$ |
| * be the <em>signed rank sums</em> where |
| * \f[ |
| * r_i |
| * = \{ j \mid |x_j| < |x_i| \} |
| * + \frac{\{ j \mid |x_j| = |x_i| \} + 1}{2}. |
| * \f] |
| * The Wilcoxon signed-rank statistic is \f$ w = \min \{ w^+, w^- \} \f$. |
| * - <tt>rank_sum_pos FLOAT8</tt> - rank sum of all positive values, i.e., \f$ w^+ \f$ |
| * - <tt>rank_sum_neg FLOAT8</tt> - rank sum of all negative values, i.e., \f$ w^- \f$ |
| * - <tt>num BIGINT</tt> - number \f$ n \f$ of non-zero values |
| * - <tt>z_statistic FLOAT8</tt> - z-statistic |
| * \f[ |
| * z = \frac{w^+ - \frac{n(n+1)}{4}} |
| * {\sqrt{\frac{n(n+1)(2n+1)}{24} |
| * - \sum_{i=1}^n \frac{t_i^2 - 1}{48}}} |
| * \f] |
| * where \f$ t_i \f$ is the number of |
| * values with absolute value equal to \f$ |x_i| \f$. The corresponding |
| * random variable is approximately standard normally distributed. |
| * - <tt>p_value_one_sided FLOAT8</tt> - One-sided p-value i.e., |
| * \f$ \Pr[Z \geq z \mid \mu \leq 0] \f$. Computed as |
| * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>. |
| * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., |
| * \f$ \Pr[ |Z| \geq |z| \mid \mu = 0] \f$. Computed as |
| * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>. |
| * |
| * @usage |
| * - One-sample test: Test null hypothesis that the mean of a sample is at |
| * most (or equal to, respectively) \f$ \mu_0 \f$: |
| * <pre>SELECT (wsr_test(<em>value</em> - <em>mu_0</em> ORDER BY abs(<em>value</em>))).* FROM <em>source</em></pre> |
| * - Dependent paired test: Test null hypothesis that the mean difference |
| * between the first and second value in a pair is at most (or equal to, |
| * respectively) \f$ \mu_0 \f$: |
| * <pre>SELECT (wsr_test(<em>first</em> - <em>second</em> - <em>mu_0</em> ORDER BY abs(<em>first</em> - <em>second</em>))).* FROM <em>source</em></pre> |
| * If correctly determining ties is important (e.g., you may want to do so |
| * when comparing to software products that take \c first, \c second, |
| * and \c mu_0 as individual parameters), supply the precision parameter. |
| * This can be done as follows: |
| * <pre>SELECT (wsr_test( |
| <em>first</em> - <em>second</em> - <em>mu_0</em>, |
| 3 * 2^(-52) * greatest(first, second, mu_0) |
| ORDER BY abs(<em>first</em> - <em>second</em>) |
| )).* FROM <em>source</em></pre> |
| * Here \f$ 2^{-52} \f$ is the machine epsilon, which we scale to the |
| * magnitude of the input data and multiply with 3 because we have a sum with |
| * three terms. |
| * |
| * @note |
| * This aggregate must be used as an ordered aggregate |
| * (<tt>ORDER BY abs(\em value</tt>)) and will raise an exception if the |
| * absolute values are not ordered. |
| */ |
| m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! |
| CREATE |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>) |
| AGGREGATE MADLIB_SCHEMA.wsr_test( |
| /*+ "value" */ DOUBLE PRECISION, |
| /*+ "precision" */ DOUBLE PRECISION /*+ DEFAULT -1 */ |
| ) ( |
| SFUNC=MADLIB_SCHEMA.wsr_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.wsr_test_final, |
| INITCOND='{0,0,0,0,0,0,0,0,0}' |
| ); |
| !>) |
| |
| m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! |
| CREATE |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>) |
| AGGREGATE MADLIB_SCHEMA.wsr_test( |
| /*+ value */ DOUBLE PRECISION |
| ) ( |
| SFUNC=MADLIB_SCHEMA.wsr_test_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.wsr_test_final, |
| INITCOND='{0,0,0,0,0,0,0,0,0}' |
| ); |
| !>) |
| |
| DROP TYPE IF EXISTS MADLIB_SCHEMA.one_way_anova_result CASCADE; |
| CREATE TYPE MADLIB_SCHEMA.one_way_anova_result AS ( |
| sum_squares_between DOUBLE PRECISION, |
| sum_squares_within DOUBLE PRECISION, |
| df_between BIGINT, |
| df_within BIGINT, |
| mean_squares_between DOUBLE PRECISION, |
| mean_squares_within DOUBLE PRECISION, |
| statistic DOUBLE PRECISION, |
| p_value DOUBLE PRECISION |
| ); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_transition( |
| state DOUBLE PRECISION[], |
| "group" INTEGER, |
| value DOUBLE PRECISION) |
| RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_merge_states( |
| state1 DOUBLE PRECISION[], |
| state2 DOUBLE PRECISION[]) |
| RETURNS DOUBLE PRECISION[] |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| |
| CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_final( |
| state DOUBLE PRECISION[]) |
| RETURNS MADLIB_SCHEMA.one_way_anova_result |
| AS 'MODULE_PATHNAME' |
| LANGUAGE C IMMUTABLE STRICT |
| m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>); |
| /** |
| * @brief Perform one-way analysis of variance |
| * |
| * Given realizations |
| * \f$ x_{1,1}, \dots, x_{1, n_1}, x_{2,1}, \dots, x_{2,n_2}, \dots, x_{k,n_k} \f$ |
| * of i.i.d. random variables \f$ X_{i,j} \sim N(\mu_i, \sigma^2) \f$ with |
| * unknown parameters \f$ \mu_1, \dots, \mu_k \f$ and \f$ \sigma^2 \f$, test the |
| * null hypotheses \f$ H_0 : \mu_1 = \dots = \mu_k \f$. |
| * |
| * @param group Group which \c value is from. Note that \c group can assume |
| * arbitary value not limited to a continguous range of integers. |
| * @param value Value of random variate \f$ x_{i,j} \f$ |
| * |
| * @return A composite value as follows. Let \f$ n := \sum_{i=1}^k n_i \f$ be |
| * the total size of all samples. Denote by \f$ \bar x \f$ the grand |
| * mean, by \f$ \overline{x_i} \f$ the group |
| * sample means, and by \f$ s_i^2 \f$ the group |
| * sample variances. |
| * - <tt>sum_squares_between DOUBLE PRECISION</tt> - sum of squares between the |
| * group means, i.e., |
| * \f$ |
| * \mathit{SS}_b = \sum_{i=1}^k n_i (\overline{x_i} - \bar x)^2. |
| * \f$ |
| * - <tt>sum_squares_within DOUBLE PRECISION</tt> - sum of squares within the |
| * groups, i.e., |
| * \f$ |
| * \mathit{SS}_w = \sum_{i=1}^k (n_i - 1) s_i^2. |
| * \f$ |
| * - <tt>df_between BIGINT</tt> - degree of freedom for between-group variation \f$ (k-1) \f$ |
| * - <tt>df_within BIGINT</tt> - degree of freedom for within-group variation \f$ (n-k) \f$ |
| * - <tt>mean_squares_between DOUBLE PRECISION</tt> - mean square between |
| * groups, i.e., |
| * \f$ |
| * s_b^2 := \frac{\mathit{SS}_b}{k-1} |
| * \f$ |
| * - <tt>mean_squares_within DOUBLE PRECISION</tt> - mean square within |
| * groups, i.e., |
| * \f$ |
| * s_w^2 := \frac{\mathit{SS}_w}{n-k} |
| * \f$ |
| * - <tt>statistic DOUBLE PRECISION</tt> - Statistic computed as |
| * \f[ |
| * f = \frac{s_b^2}{s_w^2}. |
| * \f] |
| * This statistic is Fisher F-distributed with \f$ (k-1) \f$ degrees of |
| * freedom in the numerator and \f$ (n-k) \f$ degrees of freedom in the |
| * denominator. |
| * - <tt>p_value DOUBLE PRECISION</tt> - p-value, i.e., |
| * \f$ \Pr[ F \geq f \mid H_0] \f$. |
| * |
| * @usage |
| * - Test null hypothesis that the mean of the all samples is equal: |
| * <pre>SELECT (one_way_anova(<em>group</em>, <em>value</em>)).* FROM <em>source</em></pre> |
| */ |
| CREATE AGGREGATE MADLIB_SCHEMA.one_way_anova( |
| /*+ group */ INTEGER, |
| /*+ value */ DOUBLE PRECISION) ( |
| |
| SFUNC=MADLIB_SCHEMA.one_way_anova_transition, |
| STYPE=DOUBLE PRECISION[], |
| FINALFUNC=MADLIB_SCHEMA.one_way_anova_final, |
| m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.one_way_anova_merge_states,!>) |
| INITCOND='{0,0}' |
| ); |
| |
| m4_changequote(<!`!>,<!'!>) |