/** | |

@mainpage | |

Apache MADlib is an open-source library for scalable | |

in-database analytics. It provides data-parallel implementations of | |

mathematical, statistical, graph and machine learning methods for structured | |

and unstructured data. | |

Useful links: | |

<ul> | |

<li><a href="http://madlib.apache.org">MADlib web site</a></li> | |

<li><a href="https://cwiki.apache.org/confluence/display/MADLIB">MADlib wiki</a></li> | |

<li><a href="https://issues.apache.org/jira/browse/MADLIB/">JIRAs for reporting bugs and reviewing backlog</a></li> | |

<li><a href="https://mail-archives.apache.org/mod_mbox/madlib-user/">User mailing list</a></li> | |

<li><a href="https://mail-archives.apache.org/mod_mbox/madlib-dev/">Dev mailing list</a></li> | |

<li>User documentation for earlier releases: | |

<a href="../v1.17.0/index.html">v1.17.0</a>, | |

<a href="../v1.16/index.html">v1.16</a>, | |

<a href="../v1.15.1/index.html">v1.15.1</a>, | |

<a href="../v1.15/index.html">v1.15</a>, | |

<a href="../v1.14/index.html">v1.14</a> | |

</li> | |

</ul> | |

Please refer to the | |

<a href="https://github.com/apache/madlib/blob/master/README.md">ReadMe</a> | |

file for information about incorporated third-party material. License information | |

regarding MADlib and included third-party libraries can be found in the | |

<a href="https://github.com/apache/madlib/blob/master/LICENSE"> | |

License</a> directory. | |

@defgroup grp_datatrans Data Types and Transformations | |

@details Data types and operations that transform and shape data. | |

@defgroup grp_arraysmatrix Arrays and Matrices | |

@ingroup grp_datatrans | |

@brief Mathematical operations for arrays and matrices. | |

@details | |

These modules provide basic mathematical operations to be run on array and matrices. | |

For a distributed system, a matrix cannot simply be represented as a 2D array of numbers in memory. | |

<b>We provide two forms of distributed representation of a matrix</b>: | |

- Dense: The matrix is represented as a distributed collection of 1-D arrays. | |

An example 3x10 matrix would be the below table: | |

<pre> | |

row_id | row_vec | |

--------+------------------------- | |

1 | {9,6,5,8,5,6,6,3,10,8} | |

2 | {8,2,2,6,6,10,2,1,9,9} | |

3 | {3,9,9,9,8,6,3,9,5,6} | |

</pre> | |

- Sparse: The matrix is represented using the row and column indices for each | |

non-zero entry of the matrix. Example: | |

<pre> | |

row_id | col_id | value | |

--------+--------+------- | |

1 | 1 | 9 | |

1 | 5 | 6 | |

1 | 6 | 6 | |

2 | 1 | 8 | |

3 | 1 | 3 | |

3 | 2 | 9 | |

4 | 7 | 0 | |

(6 rows) | |

</pre> | |

| |

All matrix operations work with either form of representation. | |

In many cases, a matrix function can be <b>decomposed to vector operations | |

applied independently on each row of a matrix (or corresponding rows of two | |

matrices)</b>. We have also provided access to these internal vector operations | |

(\ref grp_array) for greater flexibility. Matrix operations like | |

<em>matrix_add</em> use the corresponding vector operation (<em>array_add</em>) | |

and also include additional validation and formating. Other functions like | |

<em>matrix_mult</em> are complex and use a combination of such vector operations | |

and other SQL operations. | |

<b>It's important to note</b> that these array functions are only available for the | |

dense format representation of the matrix. In general, the scope of a single | |

array function invocation is limited to only an array (1-dimensional or | |

2-dimensional) that fits in memory. When such function is executed on a table of | |

arrays, the function is called multiple times - once for each array (or pair of | |

arrays). On contrary, scope of a single matrix function invocation is the | |

complete matrix stored as a distributed table. | |

@{ | |

@defgroup grp_array Array Operations | |

@defgroup grp_matrix Matrix Operations | |

@defgroup grp_matrix_factorization Matrix Factorization | |

@brief Linear algebra methods that factorize a matrix into a product of matrices. | |

@details Linear algebra methods that factorize a matrix into a product of matrices. | |

@{ | |

@defgroup grp_lmf Low-Rank Matrix Factorization | |

@defgroup grp_svd Singular Value Decomposition | |

@} | |

@defgroup grp_linalg Norms and Distance Functions | |

@defgroup grp_svec Sparse Vectors | |

@} | |

@defgroup grp_encode_categorical Encoding Categorical Variables | |

@ingroup grp_datatrans | |

@defgroup grp_path Path | |

@ingroup grp_datatrans | |

@defgroup grp_pivot Pivot | |

@ingroup grp_datatrans | |

@defgroup grp_sessionize Sessionize | |

@ingroup grp_datatrans | |

@defgroup grp_stemmer Stemming | |

@ingroup grp_datatrans | |

@defgroup grp_dl Deep Learning | |

@brief A collection of modules for deep learning. | |

@details | |

There are three main steps in order to run deep learning workloads with MADlib: | |

1. <b>Preparation</b>, which includes data preprocessing and model definition. | |

Data preprocessing is required to format training data for use by frameworks | |

like Keras and TensorFlow that support mini-batching as an optimization option. | |

Model definition involves describing model architectures (and optionally | |

custom functions) and loading them into tables. | |

2. <b>Model training</b>, either one model at a time or multiple models in parallel. | |

In the latter case, you will need to define the configurations for the multiple models | |

that you want to train - this can be done manually or in an automated way using autoML methods. | |

The trained models can then be used for evaluation and inference. | |

This flowchart shows the workflow in more detail: | |

\dot | |

digraph { | |

node [fontname=Helvetica, fontsize=10 color=green]; | |

subgraph cluster_0 { | |

label="1. Model Preparation" fontname=Helvetica color=blue; | |

b [ label="Preprocess data"]; | |

c [ label="Define model architectures"]; | |

d [ label="Define custom functions (optional)"]; | |

b -> c -> d; | |

} | |

subgraph cluster_1 { | |

label="2a. Train Single Model" fontname=Helvetica color=blue; | |

e [ label="Fit"]; | |

f [ label="Inference"]; | |

d -> e -> f; | |

} | |

subgraph cluster_2 { | |

label="2b. Train Multiple Models" fontname=Helvetica color=blue; | |

g [ label="Define model configurations"]; | |

h [ label="Fit Multiple"]; | |

i [ label="Inference"]; | |

j [ label="AutoML"]; | |

d -> g -> h -> i; | |

d -> j -> i; | |

} | |

} | |

\enddot | |

@{ | |

@defgroup grp_model_prep Model Preparation | |

@brief Prepare models and data for deep learning. | |

@details Prepare models and data for deep learning. | |

@{ | |

@defgroup grp_input_preprocessor_dl Preprocess Data | |

@defgroup grp_keras_model_arch Define Model Architectures | |

@defgroup grp_custom_function Define Custom Functions | |

@} | |

@defgroup grp_keras Train Single Model | |

@defgroup grp_model_selection Train Multiple Models | |

@brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. | |

@details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. | |

@{ | |

@defgroup grp_keras_setup_model_selection Define Model Configurations | |

@defgroup grp_keras_run_model_selection Train Model Configurations | |

@defgroup grp_automl AutoML | |

@} | |

@defgroup grp_dl_utilities Utilities for Deep Learning | |

@brief Utilities specific to deep learning workflows. | |

@details Utilities specific to deep learning workflows. | |

@{ | |

@defgroup grp_gpu_configuration Show GPU Configuration | |

@} | |

@} | |

@defgroup grp_graph Graph | |

@brief Graph algorithms and measures associated with graphs. | |

@details Graph algorithms and measures associated with graphs. | |

@{ | |

@defgroup grp_apsp All Pairs Shortest Path | |

@defgroup grp_bfs Breadth-First Search | |

@defgroup grp_hits HITS | |

@defgroup grp_graph_measures Measures | |

@brief A collection of metrics computed on a graph. | |

@details A collection of metrics computed on a graph. | |

@{ | |

@defgroup grp_graph_avg_path_length Average Path Length | |

@defgroup grp_graph_closeness Closeness | |

@defgroup grp_graph_diameter Graph Diameter | |

@defgroup grp_graph_vertex_degrees In-Out Degree | |

@} | |

@defgroup grp_pagerank PageRank | |

@defgroup grp_sssp Single Source Shortest Path | |

@defgroup grp_wcc Weakly Connected Components | |

@} | |

@defgroup grp_mdl Model Selection | |

@brief Functions for model selection and model evaluation. | |

@details Functions for model selection and model evaluation. | |

@{ | |

@defgroup grp_validation Cross Validation | |

@ingroup grp_mdl | |

@defgroup grp_pred Prediction Metrics | |

@ingroup grp_mdl | |

@defgroup grp_train_test_split Train-Test Split | |

@ingroup grp_mdl | |

@} | |

@defgroup grp_sampling Sampling | |

@brief A collection of methods for sampling from a population. | |

@details A collection of methods for sampling from a population. | |

@{ | |

@defgroup grp_balance_sampling Balanced Sampling | |

@defgroup grp_strs Stratified Sampling | |

@} | |

@defgroup grp_stats Statistics | |

@brief A collection of probability and statistics modules. | |

@details A collection of probability and statistics modules. | |

@{ | |

@defgroup grp_desc_stats Descriptive Statistics | |

@brief Methods to compute descriptive statistics of a dataset. | |

@details Methods to compute descriptive statistics of a dataset. | |

@{ | |

@defgroup grp_sketches Cardinality Estimators | |

@brief Methods to estimate the number of unique values contained in data. | |

@{ | |

@defgroup grp_countmin CountMin (Cormode-Muthukrishnan) | |

@defgroup grp_fmsketch FM (Flajolet-Martin) | |

@defgroup grp_mfvsketch MFV (Most Frequent Values) | |

@} | |

@defgroup grp_correlation Covariance and Correlation | |

@defgroup grp_summary Summary | |

@} | |

@defgroup grp_inf_stats Inferential Statistics | |

@brief Methods to compute inferential statistics of a dataset. | |

@details Methods to compute inferential statistics of a dataset. | |

@{ | |

@defgroup grp_stats_tests Hypothesis Tests | |

@} | |

@defgroup grp_prob Probability Functions | |

@} | |

@defgroup grp_super Supervised Learning | |

@brief Methods to perform a variety of supervised learning tasks. | |

@details Methods to perform a variety of supervised learning tasks. | |

@{ | |

@defgroup grp_crf Conditional Random Field | |

@defgroup grp_knn k-Nearest Neighbors | |

@defgroup grp_nn Neural Network | |

@defgroup grp_regml Regression Models | |

@brief A collection of methods for modeling conditional expectation of a response variable. | |

@details A collection of methods for modeling conditional expectation of a response variable. | |

@{ | |

@defgroup grp_clustered_errors Clustered Variance | |

@defgroup grp_cox_prop_hazards Cox-Proportional Hazards Regression | |

@defgroup grp_elasticnet Elastic Net Regularization | |

@defgroup grp_glm Generalized Linear Models | |

@defgroup grp_linreg Linear Regression | |

@defgroup grp_logreg Logistic Regression | |

@defgroup grp_marginal Marginal Effects | |

@defgroup grp_multinom Multinomial Regression | |

@defgroup grp_ordinal Ordinal Regression | |

@defgroup grp_robust Robust Variance | |

@} | |

@defgroup grp_svm Support Vector Machines | |

@defgroup grp_tree Tree Methods | |

@brief A collection of recursive partitioning (tree) methods. | |

@details A collection of recursive partitioning (tree) methods. | |

@{ | |

@defgroup grp_decision_tree Decision Tree | |

@defgroup grp_random_forest Random Forest | |

@} | |

@} | |

@defgroup grp_tsa Time Series Analysis | |

@brief A collection of methods to analyze time series data. | |

@details A collection of methods to analyze time series data. | |

@{ | |

@defgroup grp_arima ARIMA | |

@} | |

@defgroup grp_unsupervised Unsupervised Learning | |

@brief A collection of methods for unsupervised learning tasks. | |

@details A collection of methods for unsupervised learning tasks. | |

@{ | |

@defgroup grp_association_rules Association Rules | |

@brief Methods used to discover patterns in transactional datasets. | |

@details Methods used to discover patterns in transactional datasets. | |

@{ | |

@defgroup grp_assoc_rules Apriori Algorithm | |

@} | |

@defgroup grp_clustering Clustering | |

@brief Methods for clustering data. | |

@details Methods for clustering data. | |

@{ | |

@defgroup grp_kmeans k-Means Clustering | |

@} | |

@defgroup grp_pca Dimensionality Reduction | |

@brief Methods for reducing the number of variables in a dataset to obtain a set of principle variables. | |

@details Methods for reducing the number of variables in a dataset to obtain a set of principle variables. | |

@{ | |

@defgroup grp_pca_train Principal Component Analysis | |

@defgroup grp_pca_project Principal Component Projection | |

@} | |

@defgroup grp_topic_modelling Topic Modelling | |

@brief A collection of methods to uncover abstract topics in a document corpus. | |

@details A collection of methods to uncover abstract topics in a document corpus. | |

@{ | |

@defgroup grp_lda Latent Dirichlet Allocation | |

@} | |

@} | |

@defgroup grp_other_functions Utilities | |

@details Useful utilities for data science workflows. | |

@{ | |

@defgroup grp_cols2vec Columns to Vector | |

@defgroup @grp_utilities Database Functions | |

@defgroup grp_linear_solver Linear Solvers | |

@brief Methods that implement solutions for systems of consistent linear equations. | |

@details Methods that implement solutions for systems of consistent linear equations. | |

@{ | |

@defgroup grp_dense_linear_solver Dense Linear Systems | |

@defgroup grp_sparse_linear_solver Sparse Linear Systems | |

@} | |

@defgroup grp_minibatch_preprocessing Mini-Batch Preprocessor | |

@defgroup grp_pmml PMML Export | |

@defgroup grp_text_utilities Term Frequency | |

@defgroup grp_vec2cols Vector to Columns | |

@} | |

@defgroup grp_early_stage Early Stage Development | |

@details Implementations which are in an early stage of development. | |

Interface and implementation are subject to change. | |

@{ | |

@defgroup grp_cg Conjugate Gradient | |

@defgroup grp_dbscan DBSCAN | |

@defgroup grp_bayes Naive Bayes Classification | |

@defgroup grp_sample Random Sampling | |

@} | |

@defgroup grp_deprecated Deprecated Modules | |

Deprecated modules that will be removed in the | |

next major version (2.0). There are newer MADlib modules | |

that have replaced these functions. | |

@{ | |

@defgroup grp_indicator Create Indicator Variables | |

@defgroup grp_mlogreg Multinomial Logistic Regression | |

@} | |

*/ |