Reformat doc in Early Stage Development modules to new template.
diff --git a/methods/sketch/src/pg_gp/sketch.sql_in b/methods/sketch/src/pg_gp/sketch.sql_in
index bfe99f8..ab7667e 100644
--- a/methods/sketch/src/pg_gp/sketch.sql_in
+++ b/methods/sketch/src/pg_gp/sketch.sql_in
@@ -14,208 +14,336 @@
/**
@addtogroup grp_sketches
+<div class="toc"><b>Contents</b>
+<ul>
+<li>\ref grp_countmin</li>
+<li>\ref grp_fmsketch</li>
+<li>\ref grp_mfvsketch</li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
-
Sketches (sometimes called "synopsis data structures") are small randomized
in-memory data structures that capture statistical properties of a large set
-of values (e.g. a column of a table). Sketches can be formed in a single
+of values (e.g., a column of a table). Sketches can be formed in a single
pass of the data, and used to approximate a variety of descriptive statistics.
-We implement sketches as SQL User-Defined Aggregates (UDAs). Because they
+We implement sketches as SQL User-Defined Aggregates (UDAs). Because they
are single-pass, small-space and parallelized, a single query can
use many sketches to gather summary statistics on many columns of a table efficiently.
This module currently implements user-defined aggregates based on three main sketch methods:
- - <i>Flajolet-Martin (FM)</i> sketches for approximating <c>COUNT(DISTINCT)</c>.
- <i>Count-Min (CM)</i> sketches, which can be used to approximate a number of descriptive statistics including
- <c>COUNT(*)</c> of rows whose column value matches a given value in a set
- <c>COUNT(*)</c> of rows whose column value falls in a range (*)
- order statistics including <i>median</i> and <i>centiles</i> (*)
- <i>histograms</i>: both <i>equi-width</i> and <i>equi-depth</i> (*)
+ - <i>Flajolet-Martin (FM)</i> sketches for approximating <c>COUNT(DISTINCT)</c>.
- <i>Most Frequent Value (MFV)</i> sketches, which output the most
frequently-occuring values in a column, along with their associated counts.
<i>Note:</i> Features marked with a star (*) only work for discrete types that
can be cast to int8.
-@implementation
-The sketch methods consists of a number of SQL UDAs (user defined aggregates)
+The sketch methods consist of a number of SQL UDAs (user defined aggregates)
and UDFs (user defined functions), to be used directly in SQL queries.
*/
/**
@addtogroup grp_fmsketch
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#literature">Literature</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
Flajolet-Martin's distinct count estimation
implemented as a user-defined aggregate.
-@usage
-- Get the number of distinct values in a designated column.
- <pre>SELECT \ref fmsketch_dcount(<em>col_name</em>) FROM table_name;</pre>
-
-@implementation
\ref fmsketch_dcount can be run on a column of any type.
It returns an approximation to the number of distinct values
(a la <c>COUNT(DISTINCT x)</c>), but faster and approximate.
Like any aggregate, it can be combined with a GROUP BY clause to do distinct
counts per group.
+@anchor syntax
+@par Syntax
+
+Get the number of distinct values in a designated column.
+<pre class="syntax">
+fmsketch_dcount( col_name )
+</pre>
+
+@anchor examples
@examp
--# Generate some data:
-\verbatim
-sql> CREATE TABLE data(class INT, a1 INT);
-sql> INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
-sql> INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
-sql> INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
-\endverbatim
--# Find distinct number of values for each class
-\verbatim
-sql> SELECT class,fmsketch_dcount(a1) FROM data GROUP BY data.class;
+-# Generate some data.
+<pre class="example">
+CREATE TABLE data(class INT, a1 INT);
+INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
+INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
+INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
+INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
+INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
+</pre>
+
+-# Find the distinct number of values for each class.
+<pre class="example">
+SELECT class, fmsketch_dcount(a1)
+FROM data
+GROUP BY data.class;
+</pre>
+Result:
+<pre class="result">
class | fmsketch_dcount
--------+-----------------
+ ------+-----------------
2 | 2
1 | 3
(2 rows)
-\endverbatim
+</pre>
+@anchor literature
@literature
[1] P. Flajolet and N.G. Martin. Probabilistic counting algorithms for data base applications, Journal of Computer and System Sciences 31(2), pp 182-209, 1985. http://algo.inria.fr/flajolet/Publications/FlMa85.pdf
-@sa File sketch.sql_in documenting the SQL function.
+@anchor related
+@par Related Topics
+File sketch.sql_in documenting the SQL function.
*/
/**
@addtogroup grp_countmin
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#literature">Literature</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
This module implements Cormode-Muthukrishnan <i>CountMin</i> sketches
on integer values, implemented as a user-defined aggregate. It also provides
scalar functions over the sketches to produce approximate counts, order
statistics, and histograms.
-@usage
-
+@anchor syntax
+@par Syntax
- Get a sketch of a selected column specified by <em>col_name</em>.
- <pre>SELECT \ref cmsketch(<em>col_name</em>) FROM table_name;</pre>
+<pre class="syntax">
+cmsketch( col_name )
+</pre>
- Get the number of rows where <em>col_name = p</em>, computed from the sketch
obtained from <tt>cmsketch</tt>.
- <pre>SELECT \ref cmsketch_count(<em>cmsketch</em>,<em>p</em>) FROM table_name;</pre>
+<pre class="syntax">
+cmsketch_count( cmsketch,
+ p
+ )
+</pre>
- Get the number of rows where <em>col_name</em> is between <em>m</em> and <em>n</em> inclusive.
- <pre>SELECT \ref cmsketch_rangecount(<em>cmsketch</em>,<em>m</em>,<em>n</em>) FROM table_name;</pre>
+<pre class="syntax"
+cmsketch_rangecount( cmsketch,
+ m,
+ n
+ )
+</pre>
- Get the <em>k</em>th percentile of <em>col_name</em> where <em>count</em> specifies number of rows. <em>k</em> should be an integer between 1 to 99.
- <pre>SELECT \ref cmsketch_centile(<em>cmsketch</em>,<em>k</em>,<em>count</em>) FROM table_name;</pre>
+<pre class="syntax">
+cmsketch_centile( cmsketch,
+ k,
+ count
+ )
+</pre>
-- Get the median of <em>col_name</em> where <em>count</em> specifies number of rows. This is equivalent to <tt>\ref cmsketch_centile(<em>cmsketch</em>,50,<em>count</em>)</tt>.
- <pre>SELECT \ref cmsketch_median(<em>cmsketch</em>,<em>count</em>) FROM table_name;</pre>
+- Get the median of col_name where <em>count</em> specifies number of rows. This is equivalent to <tt>\ref cmsketch_centile(<em>cmsketch</em>,50,<em>count</em>)</tt>.
+<pre class="syntax">
+cmsketch_median( cmsketch,
+ count
+ )
+</pre>
- Get an n-bucket histogram for values between min and max for the column where each bucket has approximately the same width. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate.
- <pre>SELECT \ref cmsketch_width_histogram(<em>cmsketch</em>,<em>min</em>,<em>max</em>,<em>n</em>) FROM table_name;</pre>
+<pre class="syntax">
+cmsketch_width_histogram( cmsketch,
+ min,
+ max,
+ n
+ )
+</pre>
- Get an n-bucket histogram for the column where each bucket has approximately the same count. The output is a text string containing triples {lo, hi, count} representing the buckets; counts are approximate. Note that an equi-depth histogram is equivalent to a spanning set of equi-spaced centiles.
- <pre>SELECT \ref cmsketch_depth_histogram(<em>cmsketch</em>,<em>n</em>) FROM table_name;</pre>
+<pre class="syntax">
+cmsketch_depth_histogram( cmsketch,
+ n
+ )
+</pre>
+@anchor examples
@examp
--# Generate some data
-\verbatim
-sql> CREATE TABLE data(class INT, a1 INT);
-sql> INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
-sql> INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
-sql> INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
-\endverbatim
--# Count number of rows where a1 = 2 in each class
-\verbatim
-sql> SELECT class,cmsketch_count(cmsketch(a1),2) FROM data GROUP BY data.class;
+-# Generate some data.
+<pre class="example">
+CREATE TABLE data(class INT, a1 INT);
+INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
+INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
+INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
+INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
+INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
+</pre>
+
+-# Count number of rows where a1 = 2 in each class.
+<pre class="example">
+SELECT class,
+ cmsketch_count(
+ cmsketch( a1 ),
+ 2
+ )
+FROM data GROUP BY data.class;
+</pre>
+Result:
+<pre class="result">
class | cmsketch_count
--------+----------------
+ ------+----------------
2 | 0
1 | 15000
(2 rows)
-\endverbatim
--# Count number of rows where a1 is between 3 and 6
-\verbatim
-sql> SELECT class,cmsketch_rangecount(cmsketch(a1),3,6) FROM data GROUP BY data.class;
+</pre>
+
+-# Count number of rows where a1 is between 3 and 6.
+<pre class="example"
+SELECT class,
+ cmsketch_rangecount(
+ cmsketch(a1),
+ 3,
+ 6
+ )
+FROM data GROUP BY data.class;
+</pre>
+Result:
+<pre class="result">
+
class | cmsketch_rangecount
--------+---------------------
+ ------+---------------------
2 | 2000
1 | 10000
(2 rows)
-\endverbatim
--# Compute the 90th percentile of all of a1
-\verbatim
-sql> SELECT cmsketch_centile(cmsketch(a1),90,count(*)) FROM data;
+</pre>
+
+-# Compute the 90th percentile of all of a1.
+<pre class="example">
+SELECT cmsketch_centile(
+ cmsketch( a1 ),
+ 90,
+ count(*)
+ )
+FROM data;
+</pre>
+Result:
+<pre class="result">
cmsketch_centile
-------------------
+ -----------------
3
(1 row)
-\endverbatim
--# Produce an equi-width histogram with 2 bins between 0 and 10
-\verbatim
-sql> SELECT cmsketch_width_histogram(cmsketch(a1),0,10,2) FROM data;
+</pre>
+
+-# Produce an equi-width histogram with 2 bins between 0 and 10.
+<pre class="example">
+SELECT cmsketch_width_histogram(
+ cmsketch( a1 ),
+ 0,
+ 10,
+ 2
+ )
+FROM data;
+</pre>
+Result:
+<pre class="result">
cmsketch_width_histogram
-------------------------------------
+ -----------------------------------
[[0L, 4L, 35000], [5L, 10L, 2000]]
(1 row)
-\endverbatim
--# Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth
-\verbatim
-sql> SELECT cmsketch_depth_histogram(cmsketch(a1),2) FROM data;
+</pre>
+
+-# Produce an equi-depth histogram of a1 with 2 bins of approximately equal depth.
+<pre class="example">
+SELECT cmsketch_depth_histogram(
+ cmsketch( a1 ),
+ 2
+ )
+FROM data;
+</pre>
+Result:
+<pre class="result">
cmsketch_depth_histogram
------------------------------------------------------------------------
+ ----------------------------------------------------------------------
[[-9223372036854775807L, 1, 10000], [2, 9223372036854775807L, 27000]]
(1 row)
-\endverbatim
+</pre>
+@anchor literature
@literature
[1] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) . http://dimacs.rutgers.edu/~graham/pubs/html/CormodeMuthukrishnan04CMLatin.html
[2] G. Cormode. Encyclopedia entry on 'Count-Min Sketch'. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 511-516. Springer, 2009. http://dimacs.rutgers.edu/~graham/pubs/html/Cormode09b.html
-@sa File sketch.sql_in documenting the SQL functions.
-\n\n Module grp_quantile for a different implementation of quantile function.
+@anchor related
+File sketch.sql_in documenting the SQL functions.
+
+Module \ref grp_quantile for a different implementation of quantile function.
*/
/**
@addtogroup grp_mfvsketch
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#literature">Literature</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
MFVSketch: Most Frequent Values variant of CountMin sketch, implemented
as a UDA.
-@usage
Produces an n-bucket histogram for a column where each bucket counts one of the
most frequent values in the column. The output is an array of doubles {value, count}
in descending order of frequency; counts are approximated via CountMin sketches.
Ties are handled arbitrarily.
-<pre>SELECT \ref mfvsketch_top_histogram(<em>col_name</em>,n) FROM table_name;</pre>
-<pre>SELECT \ref mfvsketch_top_histogram(<em>col_name</em>,n) FROM table_name;</pre>
+
+@anchor syntax
+
+<pre class="syntax">
+mfvsketch_top_histogram( col_name,
+ n
+ )
+</pre>
The MFV frequent-value UDA comes in two different versions:
- a faithful implementation that preserves the approximation guarantees
@@ -227,34 +355,46 @@
produce good results unless the number of values requested is very small,
or the distribution is very flat.
+@anchor examples
@examp
--# Generate some data
-\verbatim
-sql> CREATE TABLE data(class INT, a1 INT);
-sql> INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
-sql> INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
-sql> INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
-sql> INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
-\endverbatim
--# Produce histogram of 5 bins and return the most frequent value and associated
-count in each bin:
-\verbatim
-sql> SELECT mfvsketch_top_histogram(a1,5) FROM data;
+-# Generate some data.
+<pre class="example">
+CREATE TABLE data(class INT, a1 INT);
+INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
+INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
+INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
+INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
+INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
+</pre>
+-# Produce a histogram of 5 bins and return the most frequent value and associated
+count in each bin.
+<pre class="example">
+SELECT mfvsketch_top_histogram( a1,
+ 5
+ )
+FROM data;
+</pre>
+Result:
+<pre class="result">
mfvsketch_top_histogram
---------------------------------------------------------------
+ -------------------------------------------------------------
[0:4][0:1]={{2,15000},{1,10000},{3,10000},{5,1000},{6,1000}}
(1 row)
-\endverbatim
+</pre>
+@anchor literature
@literature
This method is not usually called an MFV sketch in the literature; it
is a natural extension of the CountMin sketch.
-@sa File sketch.sql_in documenting the SQL functions.
-\n\n Module grp_countmin.
+@anchor related
+@par Related Topics
+
+File sketch.sql_in documenting the SQL functions.
+
+Module \ref grp_countmin.
*/
-- FM Sketch Functions
diff --git a/src/ports/postgres/modules/bayes/bayes.sql_in b/src/ports/postgres/modules/bayes/bayes.sql_in
index 9585360..c4eab0d 100644
--- a/src/ports/postgres/modules/bayes/bayes.sql_in
+++ b/src/ports/postgres/modules/bayes/bayes.sql_in
@@ -15,16 +15,287 @@
/**
@addtogroup grp_bayes
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#train">Training Function</a></li>
+<li><a href="#classify">Classify Function</a></li>
+<li><a href="#probabilities">Probabilities Function</a></li>
+<li><a href="#adhoc">Ad Hoc Computation</a></li>
+<li><a href="#notes">Implementation Notes</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#background">Technical Background</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
-
Naive Bayes refers to a stochastic model where all independent variables
\f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context)
independently contribute to the probability that a data point belongs to a
-certain class \f$ c \f$. In detail, \b Bayes' theorem states that
+certain class \f$ c \f$.
+
+Naives Bayes classification estimates feature probabilities and class priors
+using maximum likelihood or Laplacian smoothing. These parameters are then used
+to classifying new data.
+
+
+@anchor train
+@par Training Function
+
+
+Precompute feature probabilities and class priors:
+
+<pre class="syntax">
+create_nb_prepared_data_tables ( trainingSource,
+ trainingClassColumn,
+ trainingAttrColumn,
+ numAttrs,
+ featureProbsName,
+ classPriorsName
+ )
+</pre>
+
+The \e trainingSource is expected to be of the following form:
+<pre>{TABLE|VIEW} <em>trainingSource</em> (
+ ...
+ <em>trainingClassColumn</em> INTEGER,
+ <em>trainingAttrColumn</em> INTEGER[],
+ ...
+)</pre>
+
+
+The two output tables are:
+- \e featureProbsName – stores feature probabilities
+- \e classPriorsName< – stores the class priors
+
+@anchor classify
+@par Classify Function
+
+Perform Naive Bayes classification:
+<pre class="syntax">
+create_nb_classify_view ( featureProbsName,
+ classPriorsName,
+ classifySource,
+ classifyKeyColumn,
+ classifyAttrColumn,
+ numAttrs,
+ destName
+ )
+</pre>
+
+The <b>data to classify</b> is expected to be of the following form:
+<pre>{TABLE|VIEW} <em>classifySource</em> (
+ ...
+ <em>classifyKeyColumn</em> ANYTYPE,
+ <em>classifyAttrColumn</em> INTEGER[],
+ ...
+)</pre>
+
+
+This function creates the view <tt><em>destName</em></tt> mapping
+<em>classifyKeyColumn</em> to the Naive Bayes classification.
+<pre class="result">
+key | nb_classification
+ ---+------------------
+...
+</pre>
+
+@anchor probabilities
+@par Probabilities Function
+
+Compute Naive Bayes probabilities.
+<pre class="syntax">
+create_nb_probs_view( featureProbsName,
+ classPriorsName,
+ classifySource,
+ classifyKeyColumn,
+ classifyAttrColumn,
+ numAttrs,
+ destName
+ )
+</pre>
+
+This creates the view <tt><em>destName</em></tt> mapping
+<em>classifyKeyColumn</em> and every single class to the Naive Bayes
+probability:
+<pre class="result">
+key | class | nb_prob
+ ---+-------+--------
+...
+</pre>
+
+@anchor adhoc
+@par Ad Hoc Computation Function
+
+With ad hoc execution (no precomputation), the
+functions create_nb_classify_view() and create_nb_probs_view() can
+be used in an ad-hoc fashion without the
+precomputation step. In this case, replace the function arguments
+
+<pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre>
+with
+<pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre>
+
+@anchor notes
+@par Implementation Notes
+- The probabilities computed on the platforms of PostgreSQL and Greenplum
+database have a small difference due to the nature of floating point
+computation. Usually this is not important. However, if a data point has
+\f[
+P(C=c_i \mid A) \approx P(C=c_j \mid A)
+\f]
+for two classes, this data point might be classified into diferent classes on
+PostgreSQL and Greenplum. This leads to the differences in classifications
+on PostgreSQL and Greenplum for some data sets, but this should not
+affect the quality of the results.
+
+- When two classes have equal and highest probability among all classes,
+the classification result is an array of these two classes, but the order
+of the two classes is random.
+
+- The current implementation of Naive Bayes classification is only suitable
+for discontinuous (categorial) attributes.\n
+For continuous data, a typical assumption, usually used for small datasets,
+is that the continuous values associated with each class are distributed
+according to a Gaussian distribution,
+and then the probabilities \f$ P(A_i = a \mid C=c) \f$ can be estimated.
+Another common technique for handling continuous values, which is better for
+large data sets, is to use binning to discretize the values, and convert the
+continuous data into categorical bins. These approaches are currently not
+implemented and planned for future releases.
+
+- One can still provide floating point data to the naive Bayes
+classification function. Floating point numbers can be used as symbolic
+substitutions for categorial data. The classification would work best if
+there are sufficient data points for each floating point attribute. However,
+if floating point numbers are used as continuous data, no warning is raised and
+the result may not be as expected.
+
+@anchor examples
+@examp
+
+The following is an extremely simplified example of the above option #1 which
+can by verified by hand.
+
+-# The training and the classification data.
+<pre class="example">
+SELECT * FROM training;
+</pre>
+Result:
+<pre class="result">
+ id | class | attributes
+ ---+-------+------------
+ 1 | 1 | {1,2,3}
+ 2 | 1 | {1,2,1}
+ 3 | 1 | {1,4,3}
+ 4 | 2 | {1,2,2}
+ 5 | 2 | {0,2,2}
+ 6 | 2 | {0,1,3}
+(6 rows)
+</pre>
+<pre class="example">
+SELECT * FROM toclassify;
+</pre>
+Result:
+<pre class="result">
+ id | attributes
+ ---+------------
+ 1 | {0,2,1}
+ 2 | {1,2,3}
+(2 rows)
+</pre>
+
+-# Precompute feature probabilities and class priors.
+<pre class="example">
+SELECT madlib.create_nb_prepared_data_tables( 'training',
+ 'class',
+ 'attributes',
+ 3,
+ 'nb_feature_probs',
+ 'nb_class_priors'
+ );
+</pre>
+
+-# Optionally check the contents of the precomputed tables.
+<pre class="example">
+SELECT * FROM nb_class_priors;
+</pre>
+Result:
+<pre class="result">
+ class | class_cnt | all_cnt
+ ------+-----------+---------
+ 1 | 3 | 6
+ 2 | 3 | 6
+(2 rows)
+</pre>
+<pre class="example">
+SELECT * FROM nb_feature_probs;
+</pre>
+Result:
+<pre class="result">
+ class | attr | value | cnt | attr_cnt
+ ------+------+-------+-----+----------
+ 1 | 1 | 0 | 0 | 2
+ 1 | 1 | 1 | 3 | 2
+ 1 | 2 | 1 | 0 | 3
+ 1 | 2 | 2 | 2 | 3
+...
+</pre>
+
+-# Create the view with Naive Bayes classification and check the results.
+<pre class="example">
+SELECT madlib.create_nb_classify_view( 'nb_feature_probs',
+ 'nb_class_priors',
+ 'toclassify',
+ 'id',
+ 'attributes',
+ 3,
+ 'nb_classify_view_fast'
+ );
+
+SELECT * FROM nb_classify_view_fast;
+</pre>
+Result:
+<pre class="result">
+ key | nb_classification
+ ----+-------------------
+ 1 | {2}
+ 2 | {1}
+(2 rows)
+</pre>
+
+-# Look at the probabilities for each class (note that we use "Laplacian smoothing"),
+<pre class="example">
+SELECT madlib.create_nb_probs_view( 'nb_feature_probs',
+ 'nb_class_priors',
+ 'toclassify',
+ 'id',
+ 'attributes',
+ 3,
+ 'nb_probs_view_fast'
+ );
+
+SELECT * FROM nb_probs_view_fast;
+</pre>
+Result:
+<pre class="result">
+ key | class | nb_prob
+ ----+-------+---------
+ 1 | 1 | 0.4
+ 1 | 2 | 0.6
+ 2 | 1 | 0.75
+ 2 | 2 | 0.25
+(4 rows)
+</pre>
+
+
+@anchor background
+@par Technical Background
+
+In detail, \b Bayes' theorem states that
\f[
\Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n)
= \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)}
@@ -78,174 +349,8 @@
The case \f$ s = 1 \f$ is known as "Laplace smoothing". The case \f$ s = 0 \f$
trivially reduces to maximum-likelihood estimates.
-\b Note:
-(1) The probabilities computed on the platforms of PostgreSQL and Greenplum
-database have a small difference due to the nature of floating point
-computation. Usually this is not important. However, if a data point has
-\f[
-P(C=c_i \mid A) \approx P(C=c_j \mid A)
-\f]
-for two classes, this data point might be classified into diferent classes on
-PostgreSQL and Greenplum. This leads to the differences in classifications
-on PostgreSQL and Greenplum for some data sets, but this should not
-affect the quality of the results.
-(2) When two classes have equal and highest probability among all classes,
-the classification result is an array of these two classes, but the order
-of the two classes is random.
-
-(3) The current implementation of Naive Bayes classification is only suitable
-for discontinuous (categorial) attributes.
-
-For continuous data, a typical assumption, usually used for small datasets,
-is that the continuous values associated with each class are distributed
-according to a Gaussian distribution,
-and then the probabilities \f$ P(A_i = a \mid C=c) \f$ can be estimated.
-Another common technique for handling continuous values, which is better for
-large data sets, is to use binning to discretize the values, and convert the
-continuous data into categorical bins. These approaches are currently not
-implemented and planned for future releases.
-
-(4) One can still provide floating point data to the naive Bayes
-classification function. Floating point numbers can be used as symbolic
-substitutions for categorial data. The classification would work best if
-there are sufficient data points for each floating point attribute. However,
-if floating point numbers are used as continuous data, no warning is raised and
-the result may not be as expected.
-
-@input
-
-The <b>training data</b> is expected to be of the following form:
-<pre>{TABLE|VIEW} <em>trainingSource</em> (
- ...
- <em>trainingClassColumn</em> INTEGER,
- <em>trainingAttrColumn</em> INTEGER[],
- ...
-)</pre>
-
-The <b>data to classify</b> is expected to be of the following form:
-<pre>{TABLE|VIEW} <em>classifySource</em> (
- ...
- <em>classifyKeyColumn</em> ANYTYPE,
- <em>classifyAttrColumn</em> INTEGER[],
- ...
-)</pre>
-
-@usage
-
-- Precompute feature probabilities and class priors:
- <pre>SELECT \ref create_nb_prepared_data_tables(
- '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',
- <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'
- );</pre>
- This creates table <em>featureProbsName</em> for storing feature
- probabilities and table <em>classPriorsName</em> for storing the class priors.
-- Perform Naive Bayes classification:
- <pre>SELECT \ref create_nb_classify_view(
- '<em>featureProbsName</em>', '<em>classPriorsName</em>',
- '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
- <em>numAttrs</em>, '<em>destName</em>'
- );</pre>
- This creates the view <tt><em>destName</em></tt> mapping
- <em>classifyKeyColumn</em> to the Naive Bayes classification:
- <pre>key | nb_classification
-----+------------------
-...</pre>
-- Compute Naive Bayes probabilities:
- <pre>SELECT \ref create_nb_probs_view(
- '<em>featureProbsName</em>', '<em>classPriorsName</em>',
- '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
- <em>numAttrs</em>, '<em>destName</em>'
-);</pre>
- This creates the view <tt><em>destName</em></tt> mapping
- <em>classifyKeyColumn</em> and every single class to the Naive Bayes
- probability:
- <pre>key | class | nb_prob
-----+-------+--------
-...</pre>
-- Ad-hoc execution (no precomputation):
- Functions \ref create_nb_classify_view and
- \ref create_nb_probs_view can be used in an ad-hoc fashion without the above
- precomputation step. In this case, replace the function arguments
- <pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre>
- with
- <pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre>
-
-@examp
-
-The following is an extremely simplified example of the above option #1 which
-can by verified by hand.
-
--# The training and the classification data:
-\verbatim
-sql> SELECT * FROM training;
- id | class | attributes
-----+-------+------------
- 1 | 1 | {1,2,3}
- 2 | 1 | {1,2,1}
- 3 | 1 | {1,4,3}
- 4 | 2 | {1,2,2}
- 5 | 2 | {0,2,2}
- 6 | 2 | {0,1,3}
-(6 rows)
-
-sql> select * from toclassify;
- id | attributes
-----+------------
- 1 | {0,2,1}
- 2 | {1,2,3}
-(2 rows)
-\endverbatim
--# Precompute feature probabilities and class priors
-\verbatim
-sql> SELECT madlib.create_nb_prepared_data_tables(
-'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors');
-\endverbatim
--# Optionally check the contents of the precomputed tables:
-\verbatim
-sql> SELECT * FROM nb_class_priors;
- class | class_cnt | all_cnt
--------+-----------+---------
- 1 | 3 | 6
- 2 | 3 | 6
-(2 rows)
-
-sql> SELECT * FROM nb_feature_probs;
- class | attr | value | cnt | attr_cnt
--------+------+-------+-----+----------
- 1 | 1 | 0 | 0 | 2
- 1 | 1 | 1 | 3 | 2
- 1 | 2 | 1 | 0 | 3
- 1 | 2 | 2 | 2 | 3
-...
-\endverbatim
--# Create the view with Naive Bayes classification and check the results:
-\verbatim
-sql> SELECT madlib.create_nb_classify_view (
-'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast');
-
-sql> SELECT * FROM nb_classify_view_fast;
- key | nb_classification
------+-------------------
- 1 | {2}
- 2 | {1}
-(2 rows)
-\endverbatim
--# Look at the probabilities for each class (note that we use "Laplacian smoothing"):
-\verbatim
-sql> SELECT madlib.create_nb_probs_view (
-'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast');
-
-sql> SELECT * FROM nb_probs_view_fast;
- key | class | nb_prob
------+-------+---------
- 1 | 1 | 0.4
- 1 | 2 | 0.6
- 2 | 1 | 0.75
- 2 | 2 | 0.25
-(4 rows)
-\endverbatim
-
+@anchor literature
@literature
[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter
@@ -255,7 +360,9 @@
[2] Wikipedia, Naive Bayes classifier,
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
-@sa File bayes.sql_in documenting the SQL functions.
+@anchor related
+@par Related Topics
+File bayes.sql_in documenting the SQL functions.
@internal
@sa namespace bayes (documenting the implementation in Python)
diff --git a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
index 3c47cde..d11dbcc 100644
--- a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
+++ b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
@@ -11,15 +11,35 @@
/**
@addtogroup grp_cg
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Function Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
This function uses the iterative conjugate gradient method [1] to find a solution to the function: \f[ \boldsymbol Ax = \boldsymbol b \f]
where \f$ \boldsymbol A \f$ is a symmetric, positive definite matrix and \f$x\f$ and \f$ \boldsymbol b \f$ are vectors.
-@input
+@anchor syntax
+@par Function Syntax
+Conjugate gradient returns x as an array. It has the following syntax.
+
+<pre class="syntax">
+conjugate_gradient( table_name,
+ name_of_row_values_col,
+ name_of_row_number_col,
+ aray_of_b_values,
+ desired_precision
+ )
+</pre>
+
Matrix \f$ \boldsymbol A \f$ is assumed to be stored in a table where each row consists of at least two columns: array containing values of a given row, row number:
<pre>{TABLE|VIEW} <em>matrix_A</em> (
<em>row_number</em> FLOAT,
@@ -29,40 +49,50 @@
\f$ \boldsymbol b \f$ is passed as a FLOAT[] to the function.
-@usage
-Conjugate gradient can be called as follows:
-<pre>SELECT \ref conjugate_gradient('<em>table_name</em>',
- '<em>name_of_row_values_col</em>', '<em>name_of_row_number_col</em>', '<em>aray_of_b_values</em>',
- '<em>desired_precision</em>');</pre>
-Function returns x as an array.
+
+@anchor examples
@examp
--# Construct matrix A according to structure:
-\code
-sql> SELECT * FROM data;
+-# Construct matrix A according to structure.
+<pre class="example">
+SELECT * FROM data;
+</pre>
+Result:
+<pre class="result">
row_num | row_val
----------+---------
+ --------+---------
1 | {2,1}
2 | {1,4}
(2 rows)
-\endcode
--# Call conjugate gradient function:
-\code
-sql> SELECT conjugate_gradient('data','row_val','row_num','{2,1}',1E-6,1);
+</pre>
+
+-# Call the conjugate gradient function.
+<pre class="example">
+SELECT conjugate_gradient( 'data',
+ 'row_val',
+ 'row_num',
+ '{2,1}',
+ 1E-6,1
+ );
+</pre>
+<pre class="result">
INFO: COMPUTE RESIDUAL ERROR 14.5655661859659
INFO: ERROR 0.144934004246004
INFO: ERROR 3.12963615962926e-31
INFO: TEST FINAL ERROR 2.90029642185163e-29
conjugate_gradient
----------------------------
+ --------------------------
{1,-1.31838984174237e-15}
(1 row)
-\endcode
+</pre>
+@anchor literature
@literature
[1] "Conjugate gradient method" Wikipedia - http://en.wikipedia.org/wiki/Conjugate_gradient_method
-@sa File conjugate_gradient.sql_in documenting the SQL function.
+@anchor related
+@par Related Topics
+File conjugate_gradient.sql_in documenting the SQL function.
*/
/**
diff --git a/src/ports/postgres/modules/crf/crf.sql_in b/src/ports/postgres/modules/crf/crf.sql_in
index 84624d6..9eb5dbd 100644
--- a/src/ports/postgres/modules/crf/crf.sql_in
+++ b/src/ports/postgres/modules/crf/crf.sql_in
@@ -15,84 +15,30 @@
/**
@addtogroup grp_crf
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#train">Training Function</a></li>
+<li><a href="#usage">Using CRF</a></li>
+<li><a href="#input">Input</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#background">Technical Background</a></li>
+<li><a href="#literature">Literature</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
-A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special
-type of CRF that assumes the current state depends only on the previous state.
+A conditional random field (CRF) is a type of discriminative, undirected
+probabilistic graphical model. A linear-chain CRF is a special type of CRF
+that assumes the current state depends only on the previous state.
-Specifically, a linear-chain CRF is a distribution defined by
-\f[
- p_\lambda(\boldsymbol y | \boldsymbol x) =
- \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)}
- \,.
-\f]
+Feature extraction modules are provided for text-analysis tasks such as part-of-speech
+(POS) tagging and named-entity resolution (NER). Currently, six
+feature types are implemented:
-where
-- \f$ F_m(\boldsymbol x, \boldsymbol y) = \sum_{i=1}^n f_m(y_i,y_{i-1},x_i) \f$ is a global feature function that is a sum along a sequence
- \f$ \boldsymbol x \f$ of length \f$ n \f$
-- \f$ f_m(y_i,y_{i-1},x_i) \f$ is a local feature function dependent on the current token label \f$ y_i \f$, the previous token label \f$ y_{i-1} \f$,
- and the observation \f$ x_i \f$
-- \f$ \lambda_m \f$ is the corresponding feature weight
-- \f$ Z_\lambda(\boldsymbol x) \f$ is an instance-specific normalizer
-\f[
-Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')}
-\f]
-
-A linear-chain CRF estimates the weights \f$ \lambda_m \f$ by maximizing the log-likelihood
-of a given training set \f$ T=\{(x_k,y_k)\}_{k=1}^N \f$.
-
-The log-likelihood is defined as
-\f[
- \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)]
-\f]
-
-and the zero of its gradient
-\f[
- \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]]
-\f]
-
-is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.
-
-\f$E_{p_\lambda(Y|x)}[F(x,Y)]\f$ is found by using a variant of the forward-backward algorithm:
-\f[
- E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y)
- = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)}
-\f]
-\f[
- Z_\lambda(x) = \alpha_n.1^T
-\f]
- where \f$\alpha_i\f$ and \f$ \beta_i\f$ are the forward and backward state cost vectors defined by
-\f[
- \alpha_i =
- \begin{cases}
- \alpha_{i-1}M_i, & 0<i<=n\\
- 1, & i=0
- \end{cases}\\
-\f]
-\f[
- \beta_i^T =
- \begin{cases}
- M_{i+1}\beta_{i+1}^T, & 1<=i<n\\
- 1, & i=n
- \end{cases}
-\f]
-
-To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:
-\f[
- \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2}
-\f]
-
-\f[
- \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2}
-\f]
-
-
-
-Feature extraction modules are provided for text-analysis
-tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:
- Edge Feature: transition feature that encodes the transition feature
weight from current label to next label.
- Start Feature: fired when the current token is the first token in a sequence.
@@ -108,8 +54,95 @@
to get the best label sequence and the conditional probability
\f$ \Pr( \text{best label sequence} \mid \text{sequence}) \f$.
-For a full example of how to use the MADlib CRF modules for a text analytics application, see the "Example" section below.
+@anchor train
+@par Training Function
+Get number of iterations and weights for features:\n
+
+<pre class="syntax">
+lincrf( source,
+ sparse_R,
+ dense_M,
+ sparse_M,
+ featureSize,
+ tagSize,
+ featureset,
+ crf_feature,
+ maxNumIterations
+ )
+</pre>
+\b Arguments
+<dl class="arglist">
+<dt>source</dt> <dd>Name of the source relation containing the training data</dd>
+<dt>sparse_R</dt> <dd>Name of the sparse single state feature column (of type DOUBLE PRECISION[])</dd>
+<dt>dense_M</dt> <dd>Name of the dense two state feature column (of type DOUBLE PRECISION[])</dd>
+<dt>sparse_M</dt> <dd>Name of the sparse two state feature column (of type DOUBLE PRECISION[])</dd>
+<dt>featureSize</dt> <dd>Name of feature size column (of type DOUBLE PRECISION)</dd>
+<dt>tagSize</dt> <dd>The number of tags in the tag set</dd>
+<dt>featureset</dt> <dd>The unique feature set</dd>
+<dt>crf_feature</dt> <dd>The Name of output feature table</dd>
+<dt>maxNumIterations</dt> <dd>The maximum number of iterations</dd>
+</dl>
+The features and weights are stored in the table named by \e crf_feature.
+This function returns a composite value containing the following columns:
+<table class="output">
+ <tr><th>coef</th> <td>FLOAT8[]. Array of coefficients</td></tr>
+<tr><th>log_likelihood</th> <td>FLOAT8. Log-likelihood </td></tr>
+<tr><th>num_iterations</th> <td>INTEGER. The number of iterations before the algorithm terminated </td></tr>
+</table>
+
+@anchor usage
+@par Using CRF
+
+Generate text features, calculate their weights, and output the best label sequence for test data:\n
+
+ -# Create tables to store the input data, intermediate data, and output data.
+ Also import the training data to the database.
+ <pre>
+ SELECT madlib.crf_train_data( '<em>/path/to/data</em>');
+ </pre>
+ -# Generate text analytics features for the training data.
+ <pre>SELECT madlib.crf_train_fgen(
+ '<em>segmenttbl</em>',
+ '<em>regextbl</em>',
+ '<em>dictionary</em>',
+ '<em>featuretbl</em>',
+ '<em>featureset</em>');</pre>
+ -# Use linear-chain CRF for training.
+ <pre>SELECT madlib.lincrf(
+ '<em>source</em>',
+ '<em>sparse_r</em>',
+ '<em>dense_m</em>',
+ '<em>sparse_m</em>',
+ '<em>f_size</em>',
+ <em>tag_size</em>,
+ '<em>feature_set</em>',
+ '<em>featureWeights</em>',
+ '<em>maxNumIterations</em>');</pre>
+ -# Import CRF model to the database.
+ Also load the CRF testing data to the database.
+ <pre>SELECT madlib.crf_test_data(
+ '<em>/path/to/data</em>');</pre>
+ -# Generate text analytics features for the testing data.
+ <pre>SELECT madlib.crf_test_fgen(
+ '<em>segmenttbl</em>',
+ '<em>dictionary</em>',
+ '<em>labeltbl</em>',
+ '<em>regextbl</em>',
+ '<em>featuretbl</em>',
+ '<em>viterbi_mtbl</em>',
+ '<em>viterbi_rtbl</em>');</pre>
+ 'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
+ -# Run the Viterbi function to get the best label sequence and the conditional
+ probability \f$ \Pr( \text{best label sequence} \mid \text{sequence}) \f$.
+ <pre>SELECT madlib.vcrf_label(
+ '<em>segmenttbl</em>',
+ '<em>viterbi_mtbl</em>',
+ '<em>viterbi_rtbl</em>',
+ '<em>labeltbl</em>',
+ '<em>resulttbl</em>');</pre>
+
+@anchor Input
@input
- User-provided input:\n
The user is expected to at least provide the label table, the regular expression table, and the segment table:
@@ -181,76 +214,19 @@
...
)</pre>
-@usage
-- Get number of iterations and weights for features:\n
- <pre>SELECT * FROM \ref lincrf(
- '<em>featureTableName</em>', '<em>sparse_r</em>', '<em>dense_m</em>','<em>sparse_m</em>', '<em>f_size</em>', <em>tag_size</em>, '<em>feature_set</em>', '<em>featureWeightsName</em>'
- [, <em>maxNumberOfIterations</em> ] ]
-);</pre>
- where tag_size is the total number of labels.
- Output:
-<pre> lincrf
------------------
- [number of iterations]</pre>
-
- <em>featureWeightsName</em>:
-<pre> id | name | prev_label_id | label_id | weight
-----+----------------+---------------+----------+-------------------
-</pre>
-
-- Generate text features, calculate their weights, and output the best label sequence for test data:\n
- -# Create tables to store the input data, intermediate data, and output data.
- Also import the training data to the database.
- <pre>SELECT madlib.crf_train_data(
- '<em>/path/to/data</em>');</pre>
- -# Generate text analytics features for the training data.
- <pre>SELECT madlib.crf_train_fgen(
- '<em>segmenttbl</em>',
- '<em>regextbl</em>',
- '<em>dictionary</em>',
- '<em>featuretbl</em>',
- '<em>featureset</em>');</pre>
- -# Use linear-chain CRF for training.
- <pre>SELECT madlib.lincrf(
- '<em>source</em>',
- '<em>sparse_r</em>',
- '<em>dense_m</em>',
- '<em>sparse_m</em>',
- '<em>f_size</em>',
- <em>tag_size</em>,
- '<em>feature_set</em>',
- '<em>featureWeights</em>',
- '<em>maxNumIterations</em>');</pre>
- -# Import CRF model to the database.
- Also load the CRF testing data to the database.
- <pre>SELECT madlib.crf_test_data(
- '<em>/path/to/data</em>');</pre>
- -# Generate text analytics features for the testing data.
- <pre>SELECT madlib.crf_test_fgen(
- '<em>segmenttbl</em>',
- '<em>dictionary</em>',
- '<em>labeltbl</em>',
- '<em>regextbl</em>',
- '<em>featuretbl</em>',
- '<em>viterbi_mtbl</em>',
- '<em>viterbi_rtbl</em>');</pre>
- 'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
- -# Run the Viterbi function to get the best label sequence and the conditional
- probability \f$ \Pr( \text{best label sequence} \mid \text{sequence}) \f$.
- <pre>SELECT madlib.vcrf_label(
- '<em>segmenttbl</em>',
- '<em>viterbi_mtbl</em>',
- '<em>viterbi_rtbl</em>',
- '<em>labeltbl</em>',
- '<em>resulttbl</em>');</pre>
-
+@anchor examples
@examp
+This example uses a trivial training and test data set.
+
-# Load the label table, the regular expressions table, and the training segment table:
-@verbatim
-sql> SELECT * FROM crf_label;
+<pre class="example">
+SELECT * FROM crf_label;
+</pre>
+Result:
+<pre class="result">
id | label
-----+-------
+ ---+-------
1 | CD
13 | NNP
15 | PDT
@@ -260,19 +236,27 @@
33 | WP
35 | WRB
...
-
-sql> SELECT * from crf_regex;
+</pre>
+The regular expressions table:
+<pre class="example">
+SELECT * from crf_regex;
+</pre>
+<pre class="result">
pattern | name
----------------+----------------------
+ --------------+----------------------
^.+ing$ | endsWithIng
^[A-Z][a-z]+$ | InitCapital
^[A-Z]+$ | isAllCapital
^.*[0-9]+.*$ | containsDigit
...
-
-sql> SELECT * from train_segmenttbl;
+</pre>
+The training segment table:
+<pre class="example">
+SELECT * from train_segmenttbl;
+</pre>
+<pre class="result">
start_pos | doc_id | seg_text | label | max_pos
------------+--------+------------+-------+---------
+ ----------+--------+------------+-------+---------
8 | 1 | alliance | 11 | 26
10 | 1 | Ford | 13 | 26
12 | 1 | that | 5 | 26
@@ -286,20 +270,29 @@
25 | 1 | return | 11 | 26
9 | 2 | later | 19 | 10
...
-@endverbatim
--# Create the (empty) dictionary table, feature table, and feature set:
-@verbatim
-sql> CREATE TABLE crf_dictionary(token text,total integer);
-sql> CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
-sql> CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
-@endverbatim
--# Generate the training features:
-@verbatim
-sql> SELECT crf_train_fgen('train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl','train_featureset');
+</pre>
-sql> SELECT * from crf_dictionary;
+-# Create the (empty) dictionary table, feature table, and feature set:
+<pre class="example">
+CREATE TABLE crf_dictionary(token text,total integer);
+CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
+CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
+</pre>
+
+-# Generate the training features:
+<pre class="example">
+SELECT crf_train_fgen( 'train_segmenttbl',
+ 'crf_regex',
+ 'crf_dictionary',
+ 'train_featuretbl',
+ 'train_featureset'
+ );
+SELECT * from crf_dictionary;
+</pre>
+Result:
+<pre class="result">
token | total
-------------+-------
+ -----------+-------
talks | 1
that | 1
would | 1
@@ -309,16 +302,23 @@
after | 1
operations | 1
...
-
-sql> SELECT * from train_featuretbl;
+</pre>
+<pre class="example">
+SELECT * from train_featuretbl;
+</pre>
+Result:
+<pre class="result">
doc_id | f_size | sparse_r | dense_m | sparse_m
---------+--------+-------------------------------+---------------------------------+-----------------------
+ -------+--------+-------------------------------+---------------------------------+-----------------------
2 | 87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
1 | 87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
-
-sql> SELECT * from train_featureset;
+</pre>
+<pre class="example">
+SELECT * from train_featureset;
+</pre>
+<pre class="result">
f_index | f_name | feature
----------+---------------+---------
+ --------+---------------+---------
1 | R_endsWithED | {-1,29}
13 | W_outweigh | {-1,26}
29 | U | {-1,5}
@@ -337,22 +337,38 @@
85 | E. | {16,11}
4 | W_return | {-1,11}
...
+</pre>
-@endverbatim
-# Create the (empty) feature weight table:
-@verbatim
-sql> CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
-@endverbatim
--# Train using linear CRF:
-@verbatim
-sql> SELECT lincrf('train_featuretbl','sparse_r','dense_m','sparse_m','f_size',45, 'train_featureset','train_crf_feature', 20);
- lincrf
---------
- 20
+<pre class="example">
+CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
+</pre>
-sql> SELECT * from train_crf_feature;
+-# Train using linear CRF:
+<pre class="example">
+SELECT lincrf( 'train_featuretbl',
+ 'sparse_r',
+ 'dense_m',
+ 'sparse_m',
+ 'f_size',45,
+ 'train_featureset',
+ 'train_crf_feature',
+ 20
+ );
+</pre>
+<pre class="result">
+ lincrf
+ -------
+ 20
+</pre>
+View the feature weight table.
+<pre class="example">
+SELECT * from train_crf_feature;
+</pre>
+Result:
+<pre class="result">
id | name | prev_label_id | label_id | weight
-----+---------------+---------------+----------+-------------------
+ ---+---------------+---------------+----------+-------------------
1 | R_endsWithED | -1 | 29 | 1.54128249293937
13 | W_outweigh | -1 | 26 | 1.70691232223653
29 | U | -1 | 5 | 1.40708515869008
@@ -369,13 +385,16 @@
71 | E. | 2 | 11 | 3.00970493772732
83 | W_the | -1 | 2 | 2.58742315259326
...
+</pre>
-@endverbatim
-# To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
-@verbatim
-sql> SELECT * from test_segmenttbl;
+<pre class="example">
+SELECT * from test_segmenttbl;
+</pre>
+Result:
+<pre class="result">
start_pos | doc_id | seg_text | max_pos
------------+--------+-------------+---------
+ ----------+--------+-------------+---------
1 | 1 | collapse | 22
13 | 1 | , | 22
15 | 1 | is | 22
@@ -385,16 +404,35 @@
18 | 1 | defensive | 22
20 | 1 | with | 22
...
+</pre>
+<pre class="example">
+SELECT crf_test_fgen( 'test_segmenttbl',
+ 'crf_dictionary',
+ 'crf_label',
+ 'crf_regex',
+ 'train_crf_feature',
+ 'viterbi_mtbl',
+ 'viterbi_rtbl'
+ );
+</pre>
-sql> SELECT crf_test_fgen('test_segmenttbl','crf_dictionary','crf_label','crf_regex','train_crf_feature','viterbi_mtbl','viterbi_rtbl');
-@endverbatim
--# Calculate the best label sequence:
-@verbatim
-sql> SELECT vcrf_label('test_segmenttbl','viterbi_mtbl','viterbi_rtbl','crf_label','extracted_best_labels');
-
-sql> SELECT * FROM extracted_best_labels;
+-# Calculate the best label sequence and save in the table \c extracted_best_labels.
+<pre class="example">
+SELECT vcrf_label( 'test_segmenttbl',
+ 'viterbi_mtbl',
+ 'viterbi_rtbl',
+ 'crf_label',
+ 'extracted_best_labels'
+ );
+</pre>
+View the best labels.
+<pre class="example">
+SELECT * FROM extracted_best_labels;
+</pre>
+Result:
+<pre class="result">
doc_id | start_pos | seg_text | label | id | prob
---------+-----------+-------------+-------+----+-------
+ -------+-----------+-------------+-------+----+-------
1 | 2 | Friday | NNP | 14 | 9e-06
1 | 6 | Ford | NNP | 14 | 9e-06
1 | 12 | Jaguar | NNP | 14 | 9e-06
@@ -407,8 +445,79 @@
1 | 1 | collapse | CC | 1 | 9e-06
1 | 7 | would | POS | 17 | 9e-06
...
-@endverbatim
-(Note that this example was done on a trivial training and test data set.)
+</pre>
+
+
+@anchor background
+@par Technical Background
+
+Specifically, a linear-chain CRF is a distribution defined by
+\f[
+ p_\lambda(\boldsymbol y | \boldsymbol x) =
+ \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)}
+ \,.
+\f]
+
+where
+- \f$ F_m(\boldsymbol x, \boldsymbol y) = \sum_{i=1}^n f_m(y_i,y_{i-1},x_i) \f$ is a global feature function that is a sum along a sequence
+ \f$ \boldsymbol x \f$ of length \f$ n \f$
+- \f$ f_m(y_i,y_{i-1},x_i) \f$ is a local feature function dependent on the current token label \f$ y_i \f$, the previous token label \f$ y_{i-1} \f$,
+ and the observation \f$ x_i \f$
+- \f$ \lambda_m \f$ is the corresponding feature weight
+- \f$ Z_\lambda(\boldsymbol x) \f$ is an instance-specific normalizer
+\f[
+Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')}
+\f]
+
+A linear-chain CRF estimates the weights \f$ \lambda_m \f$ by maximizing the log-likelihood
+of a given training set \f$ T=\{(x_k,y_k)\}_{k=1}^N \f$.
+
+The log-likelihood is defined as
+\f[
+ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)]
+\f]
+
+and the zero of its gradient
+\f[
+ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]]
+\f]
+
+is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.
+
+\f$E_{p_\lambda(Y|x)}[F(x,Y)]\f$ is found by using a variant of the forward-backward algorithm:
+\f[
+ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y)
+ = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)}
+\f]
+\f[
+ Z_\lambda(x) = \alpha_n.1^T
+\f]
+ where \f$\alpha_i\f$ and \f$ \beta_i\f$ are the forward and backward state cost vectors defined by
+\f[
+ \alpha_i =
+ \begin{cases}
+ \alpha_{i-1}M_i, & 0<i<=n\\
+ 1, & i=0
+ \end{cases}\\
+\f]
+\f[
+ \beta_i^T =
+ \begin{cases}
+ M_{i+1}\beta_{i+1}^T, & 1<=i<n\\
+ 1, & i=n
+ \end{cases}
+\f]
+
+To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:
+\f[
+ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2}
+\f]
+
+\f[
+ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2}
+\f]
+
+
@literature
[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf
@@ -425,7 +534,10 @@
[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html
-@sa File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)
+@anchor related
+@par Related Topics
+
+File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)
*/
diff --git a/src/ports/postgres/modules/data_profile/profile.sql_in b/src/ports/postgres/modules/data_profile/profile.sql_in
index 14b71a2..bd447d0 100644
--- a/src/ports/postgres/modules/data_profile/profile.sql_in
+++ b/src/ports/postgres/modules/data_profile/profile.sql_in
@@ -15,46 +15,65 @@
/**
@addtogroup grp_profile
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Function Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#notes">Implementation Notes</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
This module computes a "profile" of a table or view: a predefined set of
aggregates to be run on each column of a table.
-The following aggregates will be called on every integer column:
+The following aggregates are called on every integer column:
- min(), max(), avg()
- madlib.cmsketch_median()
- madlib.cmsketch_depth_histogram()
- madlib.cmsketch_width_histogram()
-And these on non-integer columns:
+The following aggregates are called on non-integer columns:
- madlib.fmsketch_dcount()
- madlib.mfvsketch_quick_histogram()
- madlib.mfvsketch_top_histogram()
-Because the input schema of the table or view is unknown, we need to synthesize
-SQL to suit. This is done either via the <c>profile</c> or <c>profile_full</c>
+Because the input schema of the table or view is unknown, SQL is synthesized
+to suit the input. This is done either via the <c>profile</c> or <c>profile_full</c>
user defined function.
-@usage
+@anchor syntax
+@par Function Syntax
-- Generate a basic profile information (subset of predefined aggregate functions)
- for all columns of the input table.
- <pre>SELECT * FROM \ref profile( '<em>input_table</em>');</pre>
-- Generate a full profile information (all predefined aggregate functions)
- for all columns of the input table.
- <pre>SELECT * FROM \ref profile_full( '<em>input_table</em>', <em>buckets</em>);</pre>
+Generate basic profile information (subset of predefined aggregate functions)
+for all columns of the input table.
+<pre class="syntax">
+profile( input_table )</pre>
+</pre>
+Generate full profile information (all predefined aggregate functions)
+ for all columns of the input table.
+<pre class="syntax">
+profile_full( input_table,
+ buckets
+ )</pre>
+</pre>
+
+@anchor examples
@examp
-- For basic profile run:
-\verbatim
-sql> SELECT * FROM profile( 'pg_catalog.pg_tables');
-
+-# Generate basic profile information.
+<pre class="example">
+SELECT * FROM profile( 'pg_catalog.pg_tables');
+</pre>
+Result:
+<pre class="result">
schema_name | table_name | column_name | function | value
--------------+------------+-------------+-------------------+-------
+ ------------+------------+-------------+-------------------+-------
pg_catalog | pg_tables | * | COUNT() | 105
pg_catalog | pg_tables | schemaname | fmsketch_dcount() | 6
pg_catalog | pg_tables | tablename | fmsketch_dcount() | 104
@@ -64,14 +83,19 @@
pg_catalog | pg_tables | hasrules | fmsketch_dcount() | 1
pg_catalog | pg_tables | hastriggers | fmsketch_dcount() | 2
(8 rows)
-\endverbatim
+</pre>
-- For full profile run:
-\verbatim
-sql> SELECT * FROM profile_full( 'pg_catalog.pg_tables', 5);
+-# Generate full profile information.
+<pre class="example">
+SELECT * FROM profile_full( 'pg_catalog.pg_tables',
+ 5
+ );
+</pre>
+Result:
+<pre class="result">
schema_name | table_name | column_name | function | value
--------------+------------+-------------+-------------------------------------------------+----------------------------------------------------------------------------------------------------
+ ------------+------------+-------------+-------------------------------------------------+----------------------------------------------------------------------------------------------------
pg_catalog | pg_tables | * | COUNT() | 105
pg_catalog | pg_tables | schemaname | fmsketch_dcount() | 6
pg_catalog | pg_tables | schemaname | array_collapse(mfvsketch_quick_histogram((),5)) | [0:4]={pg_catalog:68,public:19,information_schema:7,gp_toolkit:5,maddy:5}
@@ -95,16 +119,19 @@
pg_catalog | pg_tables | hastriggers | array_collapse(mfvsketch_quick_histogram((),5)) | [0:1]={f:102,t:3}
pg_catalog | pg_tables | hastriggers | array_collapse(mfvsketch_top_histogram((),5)) | [0:1]={f:102,t:3}
(22 rows)
-\endverbatim
+</pre>
-@implementation
+@anchor notes
+@par Implementation Notes
Because some of the aggregate functions used in profile return multi-dimensional
arrays, which are not easily handled in pl/python, we are using
<c>array_collapse</c> function to collaps the n-dim arrays to 1-dim arrays.
All values of 2 and upper dimensions are separated with ':' character.
-@sa File profile.sql_in documenting SQL functions.
+@anchor related
+@par Related Topics
+File profile.sql_in documenting SQL functions.
*/
CREATE TYPE MADLIB_SCHEMA.profile_result AS (
diff --git a/src/ports/postgres/modules/linalg/linalg.sql_in b/src/ports/postgres/modules/linalg/linalg.sql_in
index 96631f5..5ee4a68 100644
--- a/src/ports/postgres/modules/linalg/linalg.sql_in
+++ b/src/ports/postgres/modules/linalg/linalg.sql_in
@@ -14,22 +14,77 @@
/**
@addtogroup grp_linalg
-\warning <em> This MADlib method is still in early stage development. There may be some
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#functions">Linear Algebra Utility Functions</a></li>
+<li><a href="#related">Related Functions</a></li>
+</ul>
+</div>
+
+
+\warning <em> This MADlib module is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
The linalg module consists of useful utility functions for basic linear
algebra operations. Several of these functions can be used
while implementing new algorithms.
-Refer to the file for documentation on each of the utlity functions.
+Refer to the linalg.sql_in file for documentation on each of the utility functions.
+
+@anchor functions
+@par Linear Algebra Utility Functions
+<table class="output">
+
+<tr><th>norm1()</th></tr>
+<td>1-norm of a vector.</td>
+
+<tr><th>norm2()</th>
+<td>2-norm of a vector. </td></tr>
+
+<tr><th>dist_norm1()</th>
+<td>1-norm of the difference between two vectors. </td></tr>
+
+<tr><th>dist_norm2()</th>
+<td>2-norm of the difference between two vectors. </td></tr>
+
+<tr><th>squared_dist_norm2()</th>
+<td>Squared 2-norm of the difference between two vectors. </td></tr>
+
+<tr><th>dist_angle()</th>
+<td>Angle between two vectors. </td></tr>
+
+<tr><th>dist_tanimoto()</th>
+<td>Tanimoto distance between two vectors. </td></tr>
+
+<tr><th>closest_column()</th>
+<td>Given matrix \f$ M \f$ and vector \f$ \vec x \f$ compute the column of \f$ M \f$ that is closest to \f$ \vec x \f$.</td></tr>
-Linear-algebra functions.
+<tr><th>closest_columns()</th>
+<td>Given matrix \f$ M \f$ and vector \f$ \vec x \f$ compute the columns of \f$ M \f$ that are closest to \f$ \vec x \f$. </td></tr>
-@sa File linalg.sql_in documenting the SQL functions.
+
+<tr><th>avg()</th>
+<td>Compute the average of vectors. </td></tr>
+
+<tr><th>normalized_avg()</th>
+<td>Compute the normalized average of vectors. </td>
+</tr>
+<tr><th>matrix_agg()</th>
+<td>Combine vectors to a matrix. </td></tr>
+
+<tr><th>matrix_column()</th>
+<td>Return the column of a matrix. </td></tr>
+
+</table>
+
+
+
+@anchor related
+@par Related Topics
+File linalg.sql_in documenting the SQL functions.
*/
/**
diff --git a/src/ports/postgres/modules/quantile/quantile.sql_in b/src/ports/postgres/modules/quantile/quantile.sql_in
index 4b53882..ca3fbe2 100644
--- a/src/ports/postgres/modules/quantile/quantile.sql_in
+++ b/src/ports/postgres/modules/quantile/quantile.sql_in
@@ -13,44 +13,79 @@
/**
@addtogroup grp_quantile
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">Function Syntax</a></li>
+<li><a href="#examples">Examples</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
-This function computes the specified quantile value. It reads the name of the
+This module computes the specified quantile value. It reads the name of the
table, the specific column, and computes the quantile value based on the
fraction specified as the third argument.
For an implementation of quantile using sketches, check out the cmsketch_centile()
aggregate in the \ref grp_countmin module.
-@implementation
-There are two implementations of quantile available depending on the size of the table. <tt>quantile</tt> is best used for small tables (e.g. less than 5000 rows, with 1-2 columns in total). For larger tables,
-consider using <tt>quantile_big</tt> instead.
-@usage
-<pre>SELECT * FROM quantile( '<em>table_name</em>', '<em>col_name</em>', <em>quantile</em>);</pre>
-<pre>SELECT * FROM quantile_big( '<em>table_name</em>', '<em>col_name</em>', <em>quantile</em>);</pre>
+@anchor syntax
+@par Function Syntax
+There are two implementations of quantile available depending on the size of
+the table.
+
+quantile() is best used for small tables (i.e., less than
+5000 rows, with 1-2 columns in total).
+
+<pre class="syntax">
+quantile( table_name,
+ col_name,
+ quantile
+ )
+</pre>
+
+For larger tables, consider using quantile_big() instead.
+
+<pre class="syntax">
+quantile_big( table_name,
+ col_name,
+ quantile
+ )
+</pre>
+
+@anchor examples
@examp
--# Prepare some input:
-\verbatim
-sql> CREATE TABLE tab1 AS SELECT generate_series( 1,1000) as col1;
-\endverbatim
--# Run the quantile() function:\n
-\verbatim
-sql> SELECT quantile( 'tab1', 'col1', .3);
+-# Prepare some input.
+<pre class="example">
+CREATE TABLE tab1 AS SELECT generate_series(1, 1000) AS col1;
+</pre>
+-# Run the quantile() function.
+<pre class="example">
+SELECT quantile( 'tab1',
+ 'col1',
+ .3
+ );
+</pre>
+Result:
+<pre class="result">
quantile
---------------
+ -------------
301.48046875
(1 row)
-\endverbatim
+</pre>
-@sa File quantile.sql_in documenting the SQL function.\n\n
-Module grp_countmin for an approximate quantile implementation.
+@anchor related
+@par Related Topics
+File quantile.sql_in documenting the SQL function.
+
+Module \ref grp_countmin for an approximate quantile implementation.
*/
diff --git a/src/ports/postgres/modules/sample/sample.sql_in b/src/ports/postgres/modules/sample/sample.sql_in
index 41f28aa..8f9bd85 100644
--- a/src/ports/postgres/modules/sample/sample.sql_in
+++ b/src/ports/postgres/modules/sample/sample.sql_in
@@ -14,17 +14,48 @@
/**
@addtogroup grp_sample
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#func_list">Functions</a></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
-
The random sampling module consists of useful utility functions for sampling
-operations. Several of these functions can be used while implementing
+operations. These functions can be used while implementing
new algorithms.
-Refer to the file for documentation on each of the utlity functions.
+@anchor syntax
+@par Functions
+
+Sample a single row according to weights.
+<pre class="syntax">
+weighted_sample( value,
+ weight
+ )
+</pre>
+
+\b Arguments
+<dl class="arglist">
+<dt>value</dt>
+<dd>BIGINT or FLOAT8[]. Value of row. Uniqueness is not enforced. If a value occurs multiple times, the probability of sampling this value is proportional to the sum of its weights. </dd>
+<dt>weight</dt>
+<dd>FLOAT8. Weight for row. A negative value here is treated has zero weight. </dt>
+</dl>
+
+
+
+
+
+Refer to the file for documentation on each of the utility functions.
+
+@anchor related
+@par Related Topics
+
@sa File sample.sql_in documenting the SQL functions.
*/
diff --git a/src/ports/postgres/modules/svd_mf/svdmf.sql_in b/src/ports/postgres/modules/svd_mf/svdmf.sql_in
index 5475855..9e19502 100644
--- a/src/ports/postgres/modules/svd_mf/svdmf.sql_in
+++ b/src/ports/postgres/modules/svd_mf/svdmf.sql_in
@@ -20,7 +20,18 @@
\warning <em> This is an old implementation of Singular Value Decomposition and
has been deprecated. For the latest version of SVD, please see \ref grp_svd</em>
-@about
+<div class="toc"><b>Contents</b>
+<ul>
+<li><a href="#syntax">SVD Function Syntax</a></li>
+<li><a href="#xamples">Examples</a></li>
+<li><a href="#literature">Literature<a/></li>
+<li><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
+\warning <em> This is an old implementation of Support Vector Decomposition and
+has been deprecated. For the latest version of SVD, please see \ref grp_svd</em>
+
This module implements "partial SVD decomposition" method for representing a sparse matrix using a low-rank approximation.
Mathematically, this algorithm seeks to find matrices U and V that, for any given A, minimizes:\n
@@ -34,12 +45,23 @@
Code is based on the write-up as appears at [1], with some modifications.
-@input
+
+@anchor syntax
+@par Function Syntax
+
+The SVD function is called as follows:
+<pre class="syntax">
+svdmf_run( input_table,
+ col_name,
+ row_name,
+ value, num_features)
+</pre>
+
The <b>input matrix</b> is expected to be of the following form:
<pre>{TABLE|VIEW} <em>input_table</em> (
<em>col_num</em> INTEGER,
<em>row_num</em> INTEGER,
- <em>value</em> FLOAT
+ <em>value</em> FLOAT
)</pre>
Input is contained in a table where column number and row number for each cell
@@ -47,32 +69,33 @@
actual row and column numbers and not some random identifiers. All rows and columns must be associated with a value.
There should not be any missing row, columns or values.
-@usage
-The SVD function is called as follows:
-<pre>SELECT \ref svdmf_run( '<em>input_table</em>', '<em>col_name</em>',
- '<em>row_name</em>', '<em>value</em>', <em>num_features</em>);</pre>
The function returns two tables \c matrix_u and \c matrix_v, which represent the matrices U and V in table format.
+@anchor examples
@examp
--# Prepare an input table/view:
-\code
-CREATE TABLE svd_test (
- col INT,
- row INT,
- val FLOAT
-);
-\endcode
--# Populate the input table with some data. e.g.:
-\code
-sql> INSERT INTO svd_test SELECT (g.a%1000)+1, g.a/1000+1, random() FROM generate_series(1,1000) AS g(a);
-\endcode
--# Call svdmf_run() stored procedure, e.g.:
-\code
-sql> select madlib.svdmf_run( 'svd_test', 'col', 'row', 'val', 3);
-\endcode
--# Sample Output:
-\code
+-# Prepare an input table/view.
+<pre class="example">
+CREATE TABLE svd_test ( col INT,
+ row INT,
+ val FLOAT
+ );
+</pre>
+-# Populate the input table with some data.
+<pre class="example">
+INSERT INTO svd_test SELECT ( g.a%1000)+1, g.a/1000+1, random()
+ FROM generate_series(1,1000) AS g(a);
+</pre>
+-# Call the svdmf_run() stored procedure.
+<pre class="example">
+SELECT madlib.svdmf_run( 'svd_test',
+ 'col',
+ 'row',
+ 'val',
+ 3);
+</pre>
+Example result:
+<pre class="result">
INFO: ('Started svdmf_run() with parameters:',)
INFO: (' * input_matrix = madlib_svdsparse_test.test',)
INFO: (' * col_name = col_num',)
@@ -90,7 +113,7 @@
INFO: ('...Iteration 80: residual_error = 0.34496773222, step_size = 6.99507478893e-10, min_improvement = 1.0',)
INFO: ('Swapping residual error matrix...',)
svdmf_run
---------------------------------------------------------------------------------------------
+ -------------------------------------------------------------------------------------------
Finished SVD matrix factorisation for madlib_svdsparse_test.test (row_num, col_num, val).
Results:
@@ -100,15 +123,18 @@
* table : madlib.matrix_u
* table : madlib.matrix_v
Time elapsed: 4 minutes 47.86839 seconds.
+</pre>
-\endcode
-
+@anchor literature
@literature
[1] Simon Funk, Netflix Update: Try This at Home, December 11 2006,
http://sifter.org/~simon/journal/20061211.html
-@sa File svdmf.sql_in documenting the SQL functions.
+
+@anchor related
+@par Related Topics
+File svdmf.sql_in documenting the SQL functions.
@internal
@sa namespace svdmf (documenting the implementation in Python)
diff --git a/src/ports/postgres/modules/utilities/utilities.sql_in b/src/ports/postgres/modules/utilities/utilities.sql_in
index 3aa4164..17bb0d6 100644
--- a/src/ports/postgres/modules/utilities/utilities.sql_in
+++ b/src/ports/postgres/modules/utilities/utilities.sql_in
@@ -15,20 +15,101 @@
/**
@addtogroup grp_utilities
+<div class="toc"><b>Contents</b>
+ <ul>
+ <li><a href="#utilities">Utility Functions</a></li>
+ <li><a href="#rel;ated">Related Topics</a></li>
+ </ul>
+</div>
+
+
\warning <em> This MADlib method is still in early stage development. There may be some
issues that will be addressed in a future version. Interface and implementation
is subject to change. </em>
-@about
The utility module consists of useful utility functions to assist data
scientists in using the product. Several of these functions can be used
while implementing new algorithms.
-Refer to the file for documentation on each of the utlity functions.
+@anchor utilities
+@par Utility Functions
+
+<table class="output">
+
+ <tr>
+ <th>version()</th>
+ <td>Return MADlib build information. </td>
+ </tr>
-@sa File utilities.sql_in documenting the SQL functions.
+ <tr>
+ <th>assert()</th>
+ <td>Raise an exception if the given condition is not satisfied.</td>
+ </tr>
+
+ <tr>
+ <th>relative_error()</th>
+ <td>Compute the relative error of an approximate value.</td>
+ </tr>
+
+ <tr>
+ <th>relative_error()</th>
+ <td>Compute the relative error (w.r.t. the 2-norm) of an apprixmate vector.</td>
+ </tr>
+
+ <tr>
+ <th>check_if_raises_error()</th>
+ <td>Check if a SQL statement raises an error.</td>
+ </tr>
+
+ <tr>
+ <th>check_if_col_exists()</th>
+ <td>Check if a column exists in a table.</td>
+ </tr>
+
+ <tr>
+ <th>isnan()</th>
+ <td>Check if a floating-point number is NaN (not a number)</td>
+ </tr>
+
+ <tr>
+ <th>create_schema_pg_temp()</th>
+ <td>Create the temporary schema if it does not exist yet.</td>
+ </tr>
+
+ <tr>
+ <th>noop()</th>
+ <td>Create volatile noop function.</td>
+ </tr>
+
+ <tr>
+ <th>bytea8in()</th>
+ <td></td>
+ </tr>
+
+ <tr>
+ <th>bytea8out()</th>
+ <td></td>
+ </tr>
+
+ <tr>
+ <th>bytea8recv()</th>
+ <td></td>
+ </tr>
+
+ <tr>
+ <th>bytea8send()</th>
+ <td></td>
+ </tr>
+
+
+</table>
+
+@anchor related
+@par Related Topics
+
+File utilities.sql_in documenting the SQL functions.
*/