blob: 61dbb89bff5f07e72553d26c672f05d64fd3730e [file] [log] [blame]
/* ----------------------------------------------------------------------- */
/**
*
* @file hypothesis_tests.sql_in
*
* @brief SQL functions for statistical hypothesis tests
*
* @sa For an overview of hypthesis-test functions, see the module
* description \ref grp_stats_tests.
*
*/
/* ----------------------------------------------------------------------- */
m4_include(`SQLCommon.m4')
m4_changequote(<!,!>)
/**
@addtogroup grp_stats_tests
<div class="toc"><b>Contents</b>
<ul>
<li><a href="#input">Input</a></li>
<li><a href="#usage">Usage</a></li>
<li><a href="#examples">Examples</a></li>
<li><a href="#literature">Literature</a></li>
<li><a href="#related">Related Topics</a></li>
</ul>
</div>
@brief Provides functions to perform statistical hypothesis tests.
Hypothesis tests are used to confirm or reject a <em>null hypothesis</em>
\f$ H_0 \f$ about the distribution of random variables, given realizations of
these random variables. Since in general it is not possible to make statements
with certainty, one is interested in the probability \f$ p \f$ of seeing random
variates at least as extreme as the ones observed, assuming that \f$ H_0 \f$ is
true. If this probability \f$ p \f$ is small, \f$ H_0 \f$ will be rejected by
the test with <em>significance level</em> \f$ p \f$. Falsifying \f$ H_0 \f$ is
the canonic goal when employing a hypothesis test. That is, hypothesis tests are
typically used in order to substantiate that instead the <em>alternative
hypothesis</em> \f$ H_1 \f$ is true.
Hypothesis tests may be divided into parametric and non-parametric tests. A
parametric test assumes certain distributions and makes inferences about
parameters of the distributions (e.g., the mean of a normal distribution).
Formally, there is a given domain of possible parameters \f$ \Gamma \f$ and the
null hypothesis \f$ H_0 \f$ is the event that the true parameter
\f$ \gamma_0 \in \Gamma_0 \f$, where \f$ \Gamma_0 \subsetneq \Gamma \f$.
Non-parametric tests, on the other hand, do not assume any particular
distribution of the sample (e.g., a non-parametric test may simply test if two
distributions are similar).
The first step of a hypothesis test is to compute a <em>test statistic</em>,
which is a function of the random variates, i.e., a random variate itself.
A hypothesis test relies on the distribution of the test statistic being
(approximately) known. Now, the \f$ p \f$-value is the probability of seeing a
test statistic at least as extreme as the one observed, assuming that
\f$ H_0 \f$ is true. In a case where the null hypothesis corresponds to a family
of distributions (e.g., in a parametric test where \f$ \Gamma_0 \f$ is not a
singleton set), the \f$ p \f$-value is the supremum, over all possible
distributions according to the null hypothesis, of these probabilities.
@note Please refer to \ref hypothesis_tests.sql_in for additional technical
information on the MADlib implementation of hypothesis tests, and for
detailed function signatures for all tests.
@anchor input
@input
Input data is assumed to be normalized with all values stored row-wise. In
general, the following inputs are expected.
<b>One-sample tests</b> expect the following form:
<pre>{TABLE|VIEW} <em>source</em> (
...
<em>value</em> DOUBLE PRECISION
...
)</pre>
<b>Two-sample tests</b> expect the following form:
<pre>{TABLE|VIEW} <em>source</em> (
...
<em>first</em> BOOLEAN,
<em>value</em> DOUBLE PRECISION
...
)</pre>
The \c first column indicates whether a value is from the first sample (if \c TRUE) or the
second sample (if \c FALSE).
<b>Many-sample tests</b> expect the following form:
<pre>{TABLE|VIEW} <em>source</em> (
...
<em>group</em> INTEGER,
<em>value</em> DOUBLE PRECISION
...
)</pre>
@anchor usage
@usage
All tests are implemented as aggregate functions. The non-parametric
(rank-based) tests are implemented as ordered aggregate functions and thus
necessitate an <tt>ORDER BY</tt> clause. In the following, the most simple
forms of usage are given. Specific function signatures, as described in
\ref hypothesis_tests.sql_in, may require more arguments or a different
<tt>ORDER BY</tt> clause.
- Run a parametric one-sample test:
<pre>SELECT <em>test</em>(<em>value</em>) FROM <em>source</em></pre>
where '<em>test</em>' can be one of
- <tt>t_test_one</tt> (one-sample or dependent paired Student's t-test)
- <tt>chi2_gof_test</tt> (Pearson's chi-squared goodness of fit test, also used for chi-squared independence test as shown in example section below)
- Run a parametric two-sample/multi-sample test:
<pre>SELECT <em>test</em>(<em>first/group</em>, <em>value</em>) FROM <em>source</em></pre>
where '<em>test</em>' can be one of
- <tt>f_test</tt> (Fisher F-test)
- <tt>t_test_two_pooled</tt> (two-sample pooled Student’s t-test, i.e. equal variances)
- <tt>t_test_two_unpooled</tt> (two-sample unpooled t-test, i.e., unequal variances, also known as Welch's t-test)
- <tt>one_way_anova</tt> (one-way analysis of variance, multi-sample)
- Run a non-parametric two-sample/multi-sample test:
<pre>SELECT <em>test</em>(<em>first/group</em>, <em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre>
where '<em>test</em>' can be one of
- <tt>ks_test</tt> (Kolmogorov-Smirnov test)
- <tt>mw_test</tt> (Mann-Whitney test)
- <tt>wsr_test</tt> (Wilcoxon signed-rank test, multi-sample)
<b>Note on non-parametric tests:</b> Kolomogov-Smirnov two-sample test is based on the asymptotic theory.
The p-value is given by comparing the test statistics with the Kolomogov distribution.
The p-value is also adjusted for data with heavy tail distribution, which may give
different results than those given by R function's ks.test. See [3] for a detailed explanation.
The literature is not unanimous about the definitions of the Wilcoxon rank sum
and Mann-Whitney tests. There are two possible definitions for the statistic;
MADlib outputs the minimum of the two and uses it for significance testing. This
might give different results for both mw_test and wsr_test compared to statistical
functions in other popular packages (like R's wilcox.test function). See [4] for
a detailed explanation.
@anchor examples
@examp
- <b>One-sample and two-sample t-test</b> (data is subset of mpg data from
<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm">NIST/SEMATECH</a>)
<pre class="example">
-- Load data
DROP TABLE IF EXISTS auto83b;
CREATE TABLE auto83b (
id SERIAL,
mpg_us DOUBLE PRECISION,
mpg_j DOUBLE PRECISION
);
COPY auto83b (mpg_us, mpg_j) FROM stdin DELIMITER '|';
18|24
15|27
18|27
16|25
17|31
15|35
14|24
14|19
21|31
10|32
10|24
11|26
9| 9
\\N|32
\\N|37
\\N|38
\\N|34
\\N|34
\\N|32
\\N|33
\\N|32
\\N|25
\\N|24
\\N|37
13|\\N
12|\\N
18|\\N
21|\\N
19|\\N
21|\\N
15|\\N
16|\\N
15|\\N
11|\\N
20|\\N
21|\\N
19|\\N
15|\\N
\\.
</pre>
<pre class="example">
-- Create table for one sample tests
DROP TABLE IF EXISTS auto83b_one_sample;
CREATE TABLE auto83b_one_sample AS
SELECT mpg_us AS mpg
FROM auto83b
WHERE mpg_us is not NULL;
-- Print table
SELECT * FROM auto83b_one_sample;
</pre>
<pre class="result">
mpg
18
15
18
16
17
15
14
14
21
10
10
11
9
13
12
18
21
19
21
15
16
15
11
20
21
19
15
(27 rows)
</pre>
<pre class="example">
-- Create table for two sample tests
DROP TABLE IF EXISTS auto83b_two_sample;
CREATE TABLE auto83b_two_sample AS
SELECT TRUE AS is_us, mpg_us AS mpg
FROM auto83b
WHERE mpg_us is not NULL
UNION ALL
SELECT FALSE, mpg_j
FROM auto83b
WHERE mpg_j is not NULL;
-- Print table
SELECT * FROM auto83b_two_sample;
</pre>
<pre class="result">
is_us | mpg
-------+-----
t | 18
t | 15
t | 18
t | 16
t | 17
t | 15
t | 14
t | 14
t | 21
t | 10
t | 10
t | 11
t | 9
t | 13
t | 12
t | 18
t | 21
t | 19
t | 21
t | 15
t | 16
t | 15
t | 11
t | 20
t | 21
t | 19
t | 15
f | 24
f | 27
f | 27
f | 25
f | 31
f | 35
f | 24
f | 19
f | 31
f | 32
f | 24
f | 26
f | 9
f | 32
f | 37
f | 38
f | 34
f | 34
f | 32
f | 33
f | 32
f | 25
f | 24
f | 37
(51 rows)
</pre>
<pre class="example">
-- One sample tests
SELECT (madlib.t_test_one(mpg - 20)).* FROM auto83b_one_sample; -- test rejected for mean = 20
</pre>
<pre class="result">
statistic | df | p_value_one_sided | p_value_two_sided
------------------+----+-------------------+----------------------
-6.0532478722666 | 26 | 0.999998926789141 | 2.14642171769697e-06
</pre>
<pre class="example">
SELECT (madlib.t_test_one(mpg - 15.7)).* FROM auto83b_one_sample; -- test not rejected
</pre>
<pre class="result">
statistic | df | p_value_one_sided | p_value_two_sided
---------------------+----+-------------------+-------------------
0.00521831713126531 | 26 | 0.497938118950661 | 0.995876237901321
</pre>
<pre class="example">
-- Two sample tests
SELECT (madlib.t_test_two_pooled(is_us, mpg)).* FROM auto83b_two_sample;
</pre>
<pre class="result">
statistic | df | p_value_one_sided | p_value_two_sided
-------------------+----+-------------------+----------------------
-8.89342267075968 | 49 | 0.999999999995748 | 8.50408632402377e-12
</pre>
<pre class="example">
SELECT (madlib.t_test_two_unpooled(is_us, mpg)).* FROM auto83b_two_sample;
</pre>
<pre class="result">
statistic | df | p_value_one_sided | p_value_two_sided
-------------------+------------------+-------------------+----------------------
-8.61746388524314 | 35.1283818346179 | 0.999999999821218 | 3.57563867403599e-10
</pre>
- <b>F-Test</b> (Uses same data as above t-test)
<pre class="example">
SELECT (madlib.f_test(is_us, mpg)).* FROM auto83b_two_sample;
-- Test result indicates that the two distributions have different variances
</pre>
<pre class="result">
statistic | df1 | df2 | p_value_one_sided | p_value_two_sided
-------------------+-----+-----+-------------------+---------------------
0.311786921089247 | 26 | 23 | 0.997559863672441 | 0.00488027265511803
</pre>
- <b>Chi-squared goodness-of-fit test</b> (<a href="http://www.statsdirect.com/help/default.htm#nonparametric_methods/chisq_goodness_fit.htm">Data source</a>)
<pre class="example">
CREATE TABLE chi2_test_blood_group (
id SERIAL,
blood_group VARCHAR,
observed BIGINT,
expected DOUBLE PRECISION
);
INSERT INTO chi2_test_blood_group(blood_group, observed, expected) VALUES
('O', 67, 82.28),
('A', 83, 84.15),
('B', 29, 14.96),
('AB', 8, 5.61);
SELECT (madlib.chi2_gof_test(observed, expected)).* FROM chi2_test_blood_group;
</pre>
<pre class="result">
statistic | p_value | df | phi | contingency_coef
------------------+----------------------+----+------------------+-------------------
17.0481013341976 | 0.000690824622923826 | 3 | 2.06446732440826 | 0.899977280680593
</pre>
- <b>Chi-squared independence test</b> (<a href=http://itl.nist.gov/div898/software/dataplot/refman1/auxillar/chistest.htm>Data source</a>)
The Chi-squared independence test uses the Chi-squared goodness-of-fit function,
as shown in the example below. The expected value needs to be computed and passed
to the goodness-of-fit function. The expected value for MADlib is computed as
<em>sum of rows * sum of columns</em>, for each element of the input matrix.
For e.g., expected value for element (2,1) would be <em>sum of row 2 * sum of column 1</em>.
<pre class="example">
CREATE TABLE chi2_test_friendly (
id_x SERIAL,
values INTEGER[]
);
INSERT INTO chi2_test_friendly(values) VALUES
(array[5, 29, 14, 16]),
(array[15, 54, 14, 10]),
(array[20, 84, 17, 94]),
(array[68, 119, 26, 7]);
-- Input table is expected to be unpivoted, so need to pivot it
CREATE TABLE chi2_test_friendly_unpivoted AS
SELECT id_x, id_y, values[id_y] AS observed
FROM
chi2_test_friendly,
generate_series(1,4) AS id_y;
-- Compute Chi-squared independence statistic, by calculating expected value in the SQL and calling the goodness-of-fit function
SELECT (madlib.chi2_gof_test(observed, expected, deg_freedom)).*
FROM (
-- Compute expected values and degrees of freedom
SELECT
observed,
sum(observed) OVER (PARTITION BY id_x)::DOUBLE PRECISION *
sum(observed) OVER (PARTITION BY id_y) AS expected
FROM chi2_test_friendly_unpivoted
) p, (
SELECT
(count(DISTINCT id_x) - 1) * (count(DISTINCT id_y) - 1) AS deg_freedom
FROM chi2_test_friendly_unpivoted
) q;
</pre>
<pre class="result">
statistic | p_value | df | phi | contingency_coef
------------------+----------------------+----+------------------+-------------------
138.289841626008 | 2.32528678709871e-25 | 9 | 2.93991753313346 | 0.946730727519112
</pre>
- <b>ANOVA test</b> (<a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm">Data source</a>)
<pre class="example">
CREATE TABLE nist_anova_test (
id SERIAL,
resistance FLOAT8[]
);
INSERT INTO nist_anova_test(resistance) VALUES
(array[6.9,8.3,8.0]),
(array[5.4,6.8,10.5]),
(array[5.8,7.8,8.1]),
(array[4.6,9.2,6.9]),
(array[4.0,6.5,9.3]);
SELECT (madlib.one_way_anova(level, value)).* FROM (
SELECT level, resistance[level] AS value
FROM
nist_anova_test, (SELECT * FROM generate_series(1,3) level) q1
) q2;
</pre>
<pre class="result">
sum_squares_between | sum_squares_within | df_between | df_within | mean_squares_between | mean_squares_within | statistic | p_value
---------------------+--------------------+------------+-----------+----------------------+---------------------+------------------+--------------------
27.8973333333333 | 17.452 | 2 | 12 | 13.9486666666667 | 1.45433333333333 | 9.59110703644281 | 0.0032482226008593
</pre>
- <b>Kolmogorov-Smirnov test</b> (<a href="http://www.physics.csbsju.edu/stats/KS-test.html">Data source</a>)
<pre class="example">
CREATE TABLE ks_sample_1 AS
SELECT
TRUE AS first,
unnest(ARRAY[0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09]) AS value
UNION ALL
SELECT
FALSE,
unnest(ARRAY[-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50]);
SELECT (madlib.ks_test(first, value,
(SELECT count(value) FROM ks_sample_1 WHERE first),
(SELECT count(value) FROM ks_sample_1 WHERE NOT first)
ORDER BY value)).*
FROM ks_sample_1;
</pre>
<pre class="result">
statistic | k_statistic | p_value
-----------+-----------------+--------------------
0.45 | 1.4926782214936 | 0.0232132758544496
</pre>
- <b>Mann-Whitney test</b> (use same data as t-test)
<pre class="example">
SELECT (madlib.mw_test(is_us, mpg ORDER BY mpg)).* from auto83b_two_sample;
-- Note first parameter above is BOOLEAN
</pre>
<pre class="result">
statistic | u_statistic | p_value_one_sided | p_value_two_sided
-------------------+-------------+-------------------+----------------------
-5.50097925755249 | 32.5 | 0.999999981115618 | 3.77687645883758e-08
</pre>
- <b>Wilcoxon signed-rank test</b>
<pre class="example">
DROP TABLE IF EXISTS test_wsr;
CREATE TABLE test_wsr (
x DOUBLE PRECISION,
y DOUBLE PRECISION
);
COPY test_wsr (x, y) FROM stdin DELIMITER '|';
0.32|0.39
0.4|0.47
0.11|0.11
0.47|0.43
0.32|0.42
0.35|0.3
0.32|0.43
0.63|0.98
0.5|0.86
0.6|0.79
0.38|0.33
0.46|0.45
0.2|0.22
0.31|0.3
0.62|0.6
0.52|0.53
0.77|0.85
0.23|0.21
0.3|0.33
0.7|0.57
0.41|0.43
0.53|0.49
0.19|0.2
0.31|0.35
0.48|0.4
\\.
\s
SELECT (madlib.wsr_test(
x - y,
2 * 2^(-52) * greatest(x,y)
ORDER BY abs(x - y)
)).*
FROM test_wsr;
</pre>
<pre class="result">
statistic | rank_sum_pos | rank_sum_neg | num | z_statistic | p_value_one_sided | p_value_two_sided
-----------+--------------+--------------+-----+-------------------+-------------------+-------------------
105.5 | 105.5 | 194.5 | 24 | -1.27318365656729 | 0.898523560667509 | 0.202952878664983
</pre>
@anchor literature
@literature
[1] M. Hollander, D. Wolfe: <em>Nonparametric Statistical Methods</em>,
2nd edition, Wiley, 1999
[2] E. Lehmann, J. Romano: <em>Testing Statistical Hypotheses</em>, 3rd edition,
Springer, 2005
[3] M. Stephens: <em>Use of the Kolmogorov-Smirnov, Cramer-Von Mises and related statistics without extensive tables</em>, Journal of the Royal Statistical Society. Series B (Methodological) (1970): 115-122.
[4] Wikipedia: Mann–Whitney U test calculation, http://en.wikipedia.org/wiki/Mann-Whitney_test#Calculations
@anchor related
@par Related Topics
File hypothesis_tests.sql_in documenting the SQL functions.
*/
DROP TYPE IF EXISTS MADLIB_SCHEMA.t_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.t_test_result AS (
statistic DOUBLE PRECISION,
df DOUBLE PRECISION,
p_value_one_sided DOUBLE PRECISION,
p_value_two_sided DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_transition(
state DOUBLE PRECISION[],
value DOUBLE PRECISION
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_merge_states(
state1 DOUBLE PRECISION[],
state2 DOUBLE PRECISION[])
RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.t_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
DROP TYPE IF EXISTS MADLIB_SCHEMA.f_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.f_test_result AS (
statistic DOUBLE PRECISION,
df1 DOUBLE PRECISION,
df2 DOUBLE PRECISION,
p_value_one_sided DOUBLE PRECISION,
p_value_two_sided DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.f_test_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.f_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform one-sample or dependent paired Student t-test
*
* Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
* \f$ X_1, \dots, X_n \sim N(\mu, \sigma^2) \f$ with unknown parameters \f$ \mu \f$ and
* \f$ \sigma^2 \f$, test the null hypotheses \f$ H_0 : \mu \leq 0 \f$ and
* \f$ H_0 : \mu = 0 \f$.
*
* @param value Value of random variate \f$ x_i \f$
*
* @return A composite value as follows. We denote by \f$ \bar x \f$ the
* sample mean and by \f$ s^2 \f$ the
* sample variance.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* t = \frac{\sqrt n \cdot \bar x}{s}
* \f]
* The corresponding random
* variable is Student-t distributed with
* \f$ (n - 1) \f$ degrees of freedom.
* - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n - 1) \f$
* - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
* In detail, the result is \f$ \Pr[\bar X \geq \bar x \mid \mu = 0] \f$,
* which is a lower bound on
* \f$ \Pr[\bar X \geq \bar x \mid \mu \leq 0] \f$. Computed as
* <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
* \f$ \Pr[ |\bar X| \geq |\bar x| \mid \mu = 0] \f$. Computed as
* <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
*
* @usage
* - One-sample t-test: Test null hypothesis that the mean of a sample is at
* most (or equal to, respectively) \f$ \mu_0 \f$:
* <pre>SELECT (t_test_one(<em>value</em> - <em>mu_0</em>)).* FROM <em>source</em></pre>
* - Dependent paired t-test: Test null hypothesis that the mean difference
* between the first and second value in each pair is at most (or equal to,
* respectively) \f$ \mu_0 \f$:
* <pre>SELECT (t_test_one(<em>first</em> - <em>second</em> - <em>mu_0</em>)).*
* FROM <em>source</em></pre>
*/
CREATE AGGREGATE MADLIB_SCHEMA.t_test_one(
/*+ value */ DOUBLE PRECISION) (
SFUNC=MADLIB_SCHEMA.t_test_one_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.t_test_one_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_transition(
state DOUBLE PRECISION[],
"first" BOOLEAN,
"value" DOUBLE PRECISION)
RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_pooled_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.t_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform two-sample pooled (i.e., equal variances) Student t-test
*
* Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
* i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma^2) \f$ and
* \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
* \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
* \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
*
* @param first Indicator whether \c value is from first sample
* \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
* \f$ y_1, \dots, y_m \f$ (if \c FALSE)
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
*
* @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
* the sample means and by \f$ s_X^2, s_Y^2 \f$ the
* sample variances.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* t = \frac{\bar x - \bar y}{s_p \sqrt{1/n + 1/m}}
* \f]
* where
* \f[
* s_p^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2
* + \sum_{i=1}^m (y_i - \bar y)^2}
* {n + m - 2}
* \f]
* is the <em>pooled variance</em>.
* The corresponding random
* variable is Student-t distributed with
* \f$ (n + m - 2) \f$ degrees of freedom.
* - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n + m - 2) \f$
* - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
* In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
* which is a lower bound on
* \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
* Computed as
* <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
* \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
* Computed as
* <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
*
* @usage
* - Two-sample pooled t-test: Test null hypothesis that the mean of the first
* sample is at most (or equal to, respectively) the mean of the second
* sample:
* <pre>SELECT (t_test_pooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
*/
CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_pooled(
/*+ "first" */ BOOLEAN,
/*+ "value" */ DOUBLE PRECISION) (
SFUNC=MADLIB_SCHEMA.t_test_two_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.t_test_two_pooled_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_unpooled_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.t_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform unpooled (i.e., unequal variances) t-test (also known as
* Welch's t-test)
*
* Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
* i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma_X^2) \f$ and
* \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma_Y^2) \f$ with unknown parameters
* \f$ \mu_X, \mu_Y, \sigma_X^2, \f$ and \f$ \sigma_Y^2 \f$, test the null
* hypotheses \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
*
* @param first Indicator whether \c value is from first sample
* \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
* \f$ y_1, \dots, y_m \f$ (if \c FALSE)
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
*
* @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
* the sample means and by \f$ s_X^2, s_Y^2 \f$ the
* sample variances.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* t = \frac{\bar x - \bar y}{\sqrt{s_X^2/n + s_Y^2/m}}
* \f]
* The corresponding random variable is approximately Student-t distributed
* with
* \f[
* \frac{(s_X^2 / n + s_Y^2 / m)^2}{(s_X^2 / n)^2/(n-1) + (s_Y^2 / m)^2/(m-1)}
* \f]
* degrees of freedom (Welch–Satterthwaite formula).
* - <tt>df FLOAT8</tt> - Degrees of freedom (as above)
* - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
* In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
* which is a lower bound on
* \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
* Computed as
* <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
* \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
* Computed as
* <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
*
* @usage
* - Two-sample unpooled t-test: Test null hypothesis that the mean of the
* first sample is at most (or equal to, respectively) the mean of the second
* sample:
* <pre>SELECT (t_test_unpooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
*/
CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_unpooled(
/*+ "first" */ BOOLEAN,
/*+ "value" */ DOUBLE PRECISION) (
SFUNC=MADLIB_SCHEMA.t_test_two_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.t_test_two_unpooled_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
/**
* @brief Perform Fisher F-test
*
* Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_n \f$ of
* i.i.d. random variables \f$ X_1, \dots, X_m \sim N(\mu_X, \sigma^2) \f$ and
* \f$ Y_1, \dots, Y_n \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
* \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
* \f$ H_0 : \sigma_X < \sigma_Y \f$ and \f$ H_0 : \sigma_X = \sigma_Y \f$.
*
* @param first Indicator whether \c value is from first sample
* \f$ x_1, \dots, x_m \f$ (if \c TRUE) or from second sample
* \f$ y_1, \dots, y_n \f$ (if \c FALSE)
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
*
* @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
* the sample means and by \f$ s_X^2, s_Y^2 \f$ the
* sample variances.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* f = \frac{s_Y^2}{s_X^2}
* \f]
* The corresponding random
* variable is F-distributed with
* \f$ (n - 1) \f$ degrees of freedom in the numerator and
* \f$ (m - 1) \f$ degrees of freedom in the denominator.
* - <tt>df1 BIGINT</tt> - Degrees of freedom in the numerator \f$ (n - 1) \f$
* - <tt>df2 BIGINT</tt> - Degrees of freedom in the denominator \f$ (m - 1) \f$
* - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
* In detail, the result is \f$ \Pr[F \geq f \mid \sigma_X = \sigma_Y] \f$,
* which is a lower bound on
* \f$ \Pr[F \geq f \mid \sigma_X \leq \sigma_Y] \f$. Computed as
* <tt>(1.0 - \ref fisher_f_cdf "fisher_f_cdf"(statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
* \f$ 2 \cdot \min \{ p, 1 - p \} \f$ where
* \f$ p = \Pr[ F \geq f \mid \sigma_X = \sigma_Y] \f$. Computed as
* <tt>(min(p_value_one_sided, 1. - p_value_one_sided))</tt>.
*
* @usage
* - Test null hypothesis that the variance of the first sample is at most (or
* equal to, respectively) the variance of the second sample:
* <pre>SELECT (f_test(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
*
* @internal We reuse the two-sample t-test transition and merge functions.
*/
CREATE AGGREGATE MADLIB_SCHEMA.f_test(
/*+ "first" */ BOOLEAN,
/*+ "value" */ DOUBLE PRECISION) (
SFUNC=MADLIB_SCHEMA.t_test_two_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.f_test_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
state DOUBLE PRECISION[],
observed BIGINT,
expected DOUBLE PRECISION,
df BIGINT
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
state DOUBLE PRECISION[],
observed BIGINT,
expected DOUBLE PRECISION
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
state DOUBLE PRECISION[],
observed BIGINT
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_merge_states(
state1 DOUBLE PRECISION[],
state2 DOUBLE PRECISION[])
RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
DROP TYPE IF EXISTS MADLIB_SCHEMA.chi2_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.chi2_test_result AS (
statistic DOUBLE PRECISION,
p_value DOUBLE PRECISION,
df BIGINT,
phi DOUBLE PRECISION,
contingency_coef DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_final(
state DOUBLE PRECISION[]
) RETURNS MADLIB_SCHEMA.chi2_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform Pearson's chi-squared goodness-of-fit test
*
* Let \f$ n_1, \dots, n_k \f$ be a realization of a (vector) random variable
* \f$ N = (N_1, \dots, N_k) \f$ that follows the multinomial distribution with
* parameters \f$ k \f$ and \f$ p = (p_1, \dots, p_k) \f$. Test the null
* hypothesis \f$ H_0 : p = p^0 \f$.
*
* @param observed Number \f$ n_i \f$ of observations of the current event/row
* @param expected Expected number of observations of current event/row. This
* number is not required to be normalized. That is, \f$ p^0_i \f$ will be
* taken as \c expected divided by <tt>sum(expected)</tt>. Hence, if this
* parameter is not specified, chi2_test() will by default use
* \f$ p^0 = (\frac 1k, \dots, \frac 1k) \f$, i.e., test that \f$ p \f$ is a
* discrete uniform distribution.
* @param df Degrees of freedom. This is the number of events reduced by the
* degree of freedom lost by using the observed numbers for defining the
* expected number of observations. If this parameter is 0, the degree
* of freedom is taken as \f$ (k - 1) \f$.
*
* @return A composite value as follows. Let \f$ n = \sum_{i=1}^n n_i \f$.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* \chi^2 = \sum_{i=1}^k \frac{(n_i - np_i)^2}{np_i}
* \f]
* The corresponding random
* variable is approximately chi-squared distributed with
* \c df degrees of freedom.
* - <tt>df BIGINT</tt> - Degrees of freedom
* - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e.,
* \f$ \Pr[X^2 \geq \chi^2 \mid p = p^0] \f$. Computed as
* <tt>(1.0 - \ref chi_squared_cdf "chi_squared_cdf"(statistic))</tt>.
* - <tt>phi FLOAT8</tt> - Phi coefficient, i.e.,
* \f$ \phi = \sqrt{\frac{\chi^2}{n}} \f$
* - <tt>contingency_coef FLOAT8</tt> - Contingency coefficient, i.e.,
* \f$ \sqrt{\frac{\chi^2}{n + \chi^2}} \f$
*
* @usage
* - Test null hypothesis that all possible outcomes of a categorical variable
* are equally likely:
* <pre>SELECT (chi2_gof_test(<em>observed</em>, 1, NULL)).* FROM <em>source</em></pre>
* - Test null hypothesis that two categorical variables are independent.
* Such data is often shown in a <em>contingency table</em> (also known as
* \em crosstab). A crosstab is a matrix where possible values for the first
* variable correspond to rows and values for the second variable to
* columns. The matrix elements are the observation frequencies of the
* joint occurrence of the respective values.
* chi2_gof_test() assumes that the crosstab is stored in normalized form,
* i.e., there are three columns <tt><em>var1</em></tt>,
* <tt><em>var2</em></tt>, <tt><em>observed</em></tt>.
* <pre>SELECT (chi2_gof_test(<em>observed</em>, expected, deg_freedom)).*
*FROM (
* SELECT
* <em>observed</em>,
* sum(<em>observed</em>) OVER (PARTITION BY var1)::DOUBLE PRECISION
* * sum(<em>observed</em>) OVER (PARTITION BY var2) AS expected
* FROM <em>source</em>
*) p, (
* SELECT
* (count(DISTINCT <em>var1</em>) - 1) * (count(DISTINCT <em>var2</em>) - 1) AS deg_freedom
* FROM <em>source</em>
*) q;</pre>
*/
CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
/*+ observed */ BIGINT,
/*+ expected */ DOUBLE PRECISION /*+ DEFAULT 1 */,
/*+ df */ BIGINT /*+ DEFAULT 0 */
) (
SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0}'
);
CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
/*+ observed */ BIGINT,
/*+ expected */ DOUBLE PRECISION
) (
SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
/*+ observed */ BIGINT
) (
SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
INITCOND='{0,0,0,0,0,0,0}'
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_transition(
state DOUBLE PRECISION[],
"first" BOOLEAN,
"value" DOUBLE PRECISION,
"numFirst" BIGINT,
"numSecond" BIGINT
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
DROP TYPE IF EXISTS MADLIB_SCHEMA.ks_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.ks_test_result AS (
statistic DOUBLE PRECISION,
k_statistic DOUBLE PRECISION,
p_value DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.ks_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform Kolmogorov-Smirnov test
*
* Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
* i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
* \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
* underlying distributions function \f$ F_X, F_Y \f$ are identical, i.e.,
* \f$ H_0 : F_X = F_Y \f$.
*
* @param first Determines whether the value belongs to the first
* (if \c TRUE) or the second sample (if \c FALSE)
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
* @param m Size \f$ m \f$ of the first sample. See usage instructions below.
* @param n Size of the second sample. See usage instructions below.
*
* @return A composite value.
* - <tt>statistic FLOAT8</tt> - Kolmogorov–Smirnov statistic
* \f[
* d = \max_{t \in \mathbb R} |F_x(t) - F_y(t)|
* \f]
* where \f$ F_x(t) := \frac 1m |\{ i \mid x_i \leq t \}| \f$ and
* \f$ F_y \f$ (defined likewise) are the empirical distribution functions.
* - <tt>k_statistic FLOAT8</tt> - Kolmogorov statistic
* \f$
* k = (r + 0.12 + \frac{0.11}{r}) \cdot d
* \f$
* where
* \f$
* r = \sqrt{\frac{m n}{m+n}}.
* \f$
* and \f$ d \f$ is the statistic.
* Then \f$ k \f$ is approximately Kolmogorov distributed.
* - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., an approximate value
* for \f$ \Pr[D \geq d \mid F_X = F_Y] \f$. Computed as
* <tt>(1.0 - \ref kolmogorov_cdf "kolmogorov_cdf"(k_statistic))</tt>.
*
* @usage
* - Test null hypothesis that two samples stem from the same distribution:
* <pre>SELECT (ks_test(<em>first</em>, <em>value</em>,
* (SELECT count(<em>value</em>) FROM <em>source</em> WHERE <em>first</em>),
* (SELECT count(<em>value</em>) FROM <em>source</em> WHERE NOT <em>first</em>)
* ORDER BY <em>value</em>
*)).* FROM <em>source</em></pre>
*
* @note
* This aggregate must be used as an ordered aggregate
* (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
* not ordered.
*/
m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
CREATE
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>)
AGGREGATE MADLIB_SCHEMA.ks_test(
/*+ "first" */ BOOLEAN,
/*+ "value" */ DOUBLE PRECISION,
/*+ m */ BIGINT,
/*+ n */ BIGINT
) (
SFUNC=MADLIB_SCHEMA.ks_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.ks_test_final,
INITCOND='{0,0,0,0,0,0,0}'
);
!>)
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_transition(
state DOUBLE PRECISION[],
"first" BOOLEAN,
"value" DOUBLE PRECISION
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
DROP TYPE IF EXISTS MADLIB_SCHEMA.mw_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.mw_test_result AS (
statistic DOUBLE PRECISION,
u_statistic DOUBLE PRECISION,
p_value_one_sided DOUBLE PRECISION,
p_value_two_sided DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.mw_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform Mann-Whitney test
*
* Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
* i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
* \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
* underlying distributions are equal, i.e.,
* \f$ H_0 : \forall i,j: \Pr[X_i > Y_j] + \frac{\Pr[X_i = Y_j]}{2} = \frac 12 \f$.
*
* @param first Determines whether the value belongs to the first
* (if \c TRUE) or the second sample (if \c FALSE)
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
*
* @return A composite value.
* - <tt>statistic FLOAT8</tt> - Statistic
* \f[
* z = \frac{u - \bar x}{\sqrt{\frac{mn(m+n+1)}{12}}}
* \f]
* where \f$ u \f$ is the u-statistic computed as follows. The z-statistic
* is approximately standard normally distributed.
* - <tt>u_statistic FLOAT8</tt> - Statistic
* \f$ u = \min \{ u_x, u_y \} \f$ where
* \f[
* u_x = mn + \binom{m+1}{2} - \sum_{i=1}^m r_{x,i}
* \f]
* where
* \f[
* r_{x,i}
* = \{ j \mid x_j < x_i \} + \{ j \mid y_j < x_i \} +
* \frac{\{ j \mid x_j = x_i \} + \{ j \mid y_j = x_i \} + 1}{2}
* \f]
* is defined as the rank of \f$ x_i \f$ in the combined list of all
* \f$ m+n \f$ observations. For ties, the average rank of all equal values
* is used.
* - <tt>p_value_one_sided FLOAT8</tt> - Approximate one-sided p-value, i.e.,
* an approximate value for \f$ \Pr[Z \geq z \mid H_0] \f$. Computed as
* <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Approximate two-sided p-value, i.e.,
* an approximate value for \f$ \Pr[|Z| \geq |z| \mid H_0] \f$. Computed as
* <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
*
* @usage
* - Test null hypothesis that two samples stem from the same distribution:
* <pre>SELECT (mw_test(<em>first</em>, <em>value</em> ORDER BY <em>value</em>)).* FROM <em>source</em></pre>
*
* @note
* This aggregate must be used as an ordered aggregate
* (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
* not ordered.
*/
m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
CREATE
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>)
AGGREGATE MADLIB_SCHEMA.mw_test(
/*+ "first" */ BOOLEAN,
/*+ "value" */ DOUBLE PRECISION
) (
SFUNC=MADLIB_SCHEMA.mw_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.mw_test_final,
INITCOND='{0,0,0,0,0,0,0}'
);
!>)
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
state DOUBLE PRECISION[],
value DOUBLE PRECISION,
"precision" DOUBLE PRECISION
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
state DOUBLE PRECISION[],
value DOUBLE PRECISION
) RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
DROP TYPE IF EXISTS MADLIB_SCHEMA.wsr_test_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.wsr_test_result AS (
statistic DOUBLE PRECISION,
rank_sum_pos FLOAT8,
rank_sum_neg FLOAT8,
num BIGINT,
z_statistic DOUBLE PRECISION,
p_value_one_sided DOUBLE PRECISION,
p_value_two_sided DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.wsr_test_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform Wilcoxon-Signed-Rank test
*
* Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
* \f$ X_1, \dots, X_n \f$ with unknown mean \f$ \mu \f$, test the null
* hypotheses \f$ H_0 : \mu \leq 0 \f$ and \f$ H_0 : \mu = 0 \f$.
*
* @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$. Values of 0
* are ignored (i.e., they do not count towards \f$ n \f$).
* @param precision The precision \f$ \epsilon_i \f$ with which value is known.
* The precision determines the handling of ties. The current value
* \f$ v_i \f$ is regarded a tie with the previous value \f$ v_{i-1} \f$ if
* \f$ v_i - \epsilon_i \leq \max_{j=1, \dots, i-1} v_j + \epsilon_j \f$.
* If \c precision is negative, then it will be treated as
* <tt>value * 2^(-52)</tt>. (Note that \f$ 2^{-52} \f$ is the machine
* epsilon for type <tt>DOUBLE PRECISION</tt>.)
*
* @return A composite value:
* - <tt>statistic FLOAT8</tt> - statistic computed as follows. Let
* \f$
* w^+ = \sum_{i \mid x_i > 0} r_i
* \f$
* and
* \f$
* w^- = \sum_{i \mid x_i < 0} r_i
* \f$
* be the <em>signed rank sums</em> where
* \f[
* r_i
* = \{ j \mid |x_j| < |x_i| \}
* + \frac{\{ j \mid |x_j| = |x_i| \} + 1}{2}.
* \f]
* The Wilcoxon signed-rank statistic is \f$ w = \min \{ w^+, w^- \} \f$.
* - <tt>rank_sum_pos FLOAT8</tt> - rank sum of all positive values, i.e., \f$ w^+ \f$
* - <tt>rank_sum_neg FLOAT8</tt> - rank sum of all negative values, i.e., \f$ w^- \f$
* - <tt>num BIGINT</tt> - number \f$ n \f$ of non-zero values
* - <tt>z_statistic FLOAT8</tt> - z-statistic
* \f[
* z = \frac{w^+ - \frac{n(n+1)}{4}}
* {\sqrt{\frac{n(n+1)(2n+1)}{24}
* - \sum_{i=1}^n \frac{t_i^2 - 1}{48}}}
* \f]
* where \f$ t_i \f$ is the number of
* values with absolute value equal to \f$ |x_i| \f$. The corresponding
* random variable is approximately standard normally distributed.
* - <tt>p_value_one_sided FLOAT8</tt> - One-sided p-value i.e.,
* \f$ \Pr[Z \geq z \mid \mu \leq 0] \f$. Computed as
* <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
* - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
* \f$ \Pr[ |Z| \geq |z| \mid \mu = 0] \f$. Computed as
* <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
*
* @usage
* - One-sample test: Test null hypothesis that the mean of a sample is at
* most (or equal to, respectively) \f$ \mu_0 \f$:
* <pre>SELECT (wsr_test(<em>value</em> - <em>mu_0</em> ORDER BY abs(<em>value</em>))).* FROM <em>source</em></pre>
* - Dependent paired test: Test null hypothesis that the mean difference
* between the first and second value in a pair is at most (or equal to,
* respectively) \f$ \mu_0 \f$:
* <pre>SELECT (wsr_test(<em>first</em> - <em>second</em> - <em>mu_0</em> ORDER BY abs(<em>first</em> - <em>second</em>))).* FROM <em>source</em></pre>
* If correctly determining ties is important (e.g., you may want to do so
* when comparing to software products that take \c first, \c second,
* and \c mu_0 as individual parameters), supply the precision parameter.
* This can be done as follows:
* <pre>SELECT (wsr_test(
<em>first</em> - <em>second</em> - <em>mu_0</em>,
3 * 2^(-52) * greatest(first, second, mu_0)
ORDER BY abs(<em>first</em> - <em>second</em>)
)).* FROM <em>source</em></pre>
* Here \f$ 2^{-52} \f$ is the machine epsilon, which we scale to the
* magnitude of the input data and multiply with 3 because we have a sum with
* three terms.
*
* @note
* This aggregate must be used as an ordered aggregate
* (<tt>ORDER BY abs(\em value</tt>)) and will raise an exception if the
* absolute values are not ordered.
*/
m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
CREATE
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>)
AGGREGATE MADLIB_SCHEMA.wsr_test(
/*+ "value" */ DOUBLE PRECISION,
/*+ "precision" */ DOUBLE PRECISION /*+ DEFAULT -1 */
) (
SFUNC=MADLIB_SCHEMA.wsr_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
INITCOND='{0,0,0,0,0,0,0,0,0}'
);
!>)
m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
CREATE
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!ORDERED!>)
AGGREGATE MADLIB_SCHEMA.wsr_test(
/*+ value */ DOUBLE PRECISION
) (
SFUNC=MADLIB_SCHEMA.wsr_test_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
INITCOND='{0,0,0,0,0,0,0,0,0}'
);
!>)
DROP TYPE IF EXISTS MADLIB_SCHEMA.one_way_anova_result CASCADE;
CREATE TYPE MADLIB_SCHEMA.one_way_anova_result AS (
sum_squares_between DOUBLE PRECISION,
sum_squares_within DOUBLE PRECISION,
df_between BIGINT,
df_within BIGINT,
mean_squares_between DOUBLE PRECISION,
mean_squares_within DOUBLE PRECISION,
statistic DOUBLE PRECISION,
p_value DOUBLE PRECISION
);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_transition(
state DOUBLE PRECISION[],
"group" INTEGER,
value DOUBLE PRECISION)
RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_merge_states(
state1 DOUBLE PRECISION[],
state2 DOUBLE PRECISION[])
RETURNS DOUBLE PRECISION[]
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_final(
state DOUBLE PRECISION[])
RETURNS MADLIB_SCHEMA.one_way_anova_result
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT
m4_ifdef(<!__HAS_FUNCTION_PROPERTIES__!>, <!NO SQL!>, <!!>);
/**
* @brief Perform one-way analysis of variance
*
* Given realizations
* \f$ x_{1,1}, \dots, x_{1, n_1}, x_{2,1}, \dots, x_{2,n_2}, \dots, x_{k,n_k} \f$
* of i.i.d. random variables \f$ X_{i,j} \sim N(\mu_i, \sigma^2) \f$ with
* unknown parameters \f$ \mu_1, \dots, \mu_k \f$ and \f$ \sigma^2 \f$, test the
* null hypotheses \f$ H_0 : \mu_1 = \dots = \mu_k \f$.
*
* @param group Group which \c value is from. Note that \c group can assume
* arbitary value not limited to a continguous range of integers.
* @param value Value of random variate \f$ x_{i,j} \f$
*
* @return A composite value as follows. Let \f$ n := \sum_{i=1}^k n_i \f$ be
* the total size of all samples. Denote by \f$ \bar x \f$ the grand
* mean, by \f$ \overline{x_i} \f$ the group
* sample means, and by \f$ s_i^2 \f$ the group
* sample variances.
* - <tt>sum_squares_between DOUBLE PRECISION</tt> - sum of squares between the
* group means, i.e.,
* \f$
* \mathit{SS}_b = \sum_{i=1}^k n_i (\overline{x_i} - \bar x)^2.
* \f$
* - <tt>sum_squares_within DOUBLE PRECISION</tt> - sum of squares within the
* groups, i.e.,
* \f$
* \mathit{SS}_w = \sum_{i=1}^k (n_i - 1) s_i^2.
* \f$
* - <tt>df_between BIGINT</tt> - degree of freedom for between-group variation \f$ (k-1) \f$
* - <tt>df_within BIGINT</tt> - degree of freedom for within-group variation \f$ (n-k) \f$
* - <tt>mean_squares_between DOUBLE PRECISION</tt> - mean square between
* groups, i.e.,
* \f$
* s_b^2 := \frac{\mathit{SS}_b}{k-1}
* \f$
* - <tt>mean_squares_within DOUBLE PRECISION</tt> - mean square within
* groups, i.e.,
* \f$
* s_w^2 := \frac{\mathit{SS}_w}{n-k}
* \f$
* - <tt>statistic DOUBLE PRECISION</tt> - Statistic computed as
* \f[
* f = \frac{s_b^2}{s_w^2}.
* \f]
* This statistic is Fisher F-distributed with \f$ (k-1) \f$ degrees of
* freedom in the numerator and \f$ (n-k) \f$ degrees of freedom in the
* denominator.
* - <tt>p_value DOUBLE PRECISION</tt> - p-value, i.e.,
* \f$ \Pr[ F \geq f \mid H_0] \f$.
*
* @usage
* - Test null hypothesis that the mean of the all samples is equal:
* <pre>SELECT (one_way_anova(<em>group</em>, <em>value</em>)).* FROM <em>source</em></pre>
*/
CREATE AGGREGATE MADLIB_SCHEMA.one_way_anova(
/*+ group */ INTEGER,
/*+ value */ DOUBLE PRECISION) (
SFUNC=MADLIB_SCHEMA.one_way_anova_transition,
STYPE=DOUBLE PRECISION[],
FINALFUNC=MADLIB_SCHEMA.one_way_anova_final,
m4_ifdef(<!__POSTGRESQL__!>, <!!>, <!PREFUNC=MADLIB_SCHEMA.one_way_anova_merge_states,!>)
INITCOND='{0,0}'
);
m4_changequote(<!`!>,<!'!>)