blob: ba601625a17be11deeaf52f38c72ff1dfa4c9923 [file] [log] [blame]
/**
@mainpage
Apache MADlib is an open-source library for scalable
in-database analytics. It provides data-parallel implementations of
mathematical, statistical, graph and machine learning methods for structured
and unstructured data.
Useful links:
<ul>
<li><a href="http://madlib.apache.org">MADlib web site</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/MADLIB">MADlib wiki</a></li>
<li><a href="https://issues.apache.org/jira/browse/MADLIB/">JIRAs for reporting bugs and reviewing backlog</a></li>
<li><a href="https://mail-archives.apache.org/mod_mbox/madlib-user/">User mailing list</a></li>
<li><a href="https://mail-archives.apache.org/mod_mbox/madlib-dev/">Dev mailing list</a></li>
<li>User documentation for earlier releases:
<a href="../v1.17.0/index.html">v1.17.0</a>,
<a href="../v1.16/index.html">v1.16</a>,
<a href="../v1.15.1/index.html">v1.15.1</a>,
<a href="../v1.15/index.html">v1.15</a>,
<a href="../v1.14/index.html">v1.14</a>
</li>
</ul>
Please refer to the
<a href="https://github.com/apache/madlib/blob/master/README.md">ReadMe</a>
file for information about incorporated third-party material. License information
regarding MADlib and included third-party libraries can be found in the
<a href="https://github.com/apache/madlib/blob/master/LICENSE">
License</a> directory.
@defgroup grp_datatrans Data Types and Transformations
@details Data types and operations that transform and shape data.
@defgroup grp_arraysmatrix Arrays and Matrices
@ingroup grp_datatrans
@brief Mathematical operations for arrays and matrices.
@details
These modules provide basic mathematical operations to be run on array and matrices.
For a distributed system, a matrix cannot simply be represented as a 2D array of numbers in memory.
<b>We provide two forms of distributed representation of a matrix</b>:
- Dense: The matrix is represented as a distributed collection of 1-D arrays.
An example 3x10 matrix would be the below table:
<pre>
row_id | row_vec
--------+-------------------------
1 | {9,6,5,8,5,6,6,3,10,8}
2 | {8,2,2,6,6,10,2,1,9,9}
3 | {3,9,9,9,8,6,3,9,5,6}
</pre>
- Sparse: The matrix is represented using the row and column indices for each
non-zero entry of the matrix. Example:
<pre>
row_id | col_id | value
--------+--------+-------
1 | 1 | 9
1 | 5 | 6
1 | 6 | 6
2 | 1 | 8
3 | 1 | 3
3 | 2 | 9
4 | 7 | 0
(6 rows)
</pre>
&nbsp;
All matrix operations work with either form of representation.
In many cases, a matrix function can be <b>decomposed to vector operations
applied independently on each row of a matrix (or corresponding rows of two
matrices)</b>. We have also provided access to these internal vector operations
(\ref grp_array) for greater flexibility. Matrix operations like
<em>matrix_add</em> use the corresponding vector operation (<em>array_add</em>)
and also include additional validation and formating. Other functions like
<em>matrix_mult</em> are complex and use a combination of such vector operations
and other SQL operations.
<b>It's important to note</b> that these array functions are only available for the
dense format representation of the matrix. In general, the scope of a single
array function invocation is limited to only an array (1-dimensional or
2-dimensional) that fits in memory. When such function is executed on a table of
arrays, the function is called multiple times - once for each array (or pair of
arrays). On contrary, scope of a single matrix function invocation is the
complete matrix stored as a distributed table.
@{
@defgroup grp_array Array Operations
@defgroup grp_matrix Matrix Operations
@defgroup grp_matrix_factorization Matrix Factorization
@brief Linear algebra methods that factorize a matrix into a product of matrices.
@details Linear algebra methods that factorize a matrix into a product of matrices.
@{
@defgroup grp_lmf Low-Rank Matrix Factorization
@defgroup grp_svd Singular Value Decomposition
@}
@defgroup grp_linalg Norms and Distance Functions
@defgroup grp_svec Sparse Vectors
@}
@defgroup grp_encode_categorical Encoding Categorical Variables
@ingroup grp_datatrans
@defgroup grp_path Path
@ingroup grp_datatrans
@defgroup grp_pivot Pivot
@ingroup grp_datatrans
@defgroup grp_sessionize Sessionize
@ingroup grp_datatrans
@defgroup grp_stemmer Stemming
@ingroup grp_datatrans
@defgroup grp_dl Deep Learning
@brief A collection of modules for deep learning.
@details
There are three main steps in order to run deep learning workloads with MADlib:
1. <b>Preparation</b>, which includes data preprocessing and model definition.
Data preprocessing is required to format training data for use by frameworks
like Keras and TensorFlow that support mini-batching as an optimization option.
Model definition involves describing model architectures (and optionally
custom functions) and loading them into tables.
2. <b>Model training</b>, either one model at a time or multiple models in parallel.
In the latter case, you will need to define the configurations for the multiple models
that you want to train - this can be done manually or in an automated way using autoML methods.
The trained models can then be used for evaluation and inference.
This flowchart shows the workflow in more detail:
\dot
digraph {
node [fontname=Helvetica, fontsize=10 color=green];
subgraph cluster_0 {
label="1. Model Preparation" fontname=Helvetica color=blue;
b [ label="Preprocess data"];
c [ label="Define model architectures"];
d [ label="Define custom functions (optional)"];
b -> c -> d;
}
subgraph cluster_1 {
label="2a. Train Single Model" fontname=Helvetica color=blue;
e [ label="Fit"];
f [ label="Inference"];
d -> e -> f;
}
subgraph cluster_2 {
label="2b. Train Multiple Models" fontname=Helvetica color=blue;
g [ label="Define model configurations"];
h [ label="Fit Multiple"];
i [ label="Inference"];
j [ label="AutoML"];
d -> g -> h -> i;
d -> j -> i;
}
}
\enddot
@{
@defgroup grp_model_prep Model Preparation
@brief Prepare models and data for deep learning.
@details Prepare models and data for deep learning.
@{
@defgroup grp_input_preprocessor_dl Preprocess Data
@defgroup grp_keras_model_arch Define Model Architectures
@defgroup grp_custom_function Define Custom Functions
@}
@defgroup grp_keras Train Single Model
@defgroup grp_model_selection Train Multiple Models
@brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
@details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection.
@{
@defgroup grp_keras_setup_model_selection Define Model Configurations
@defgroup grp_keras_run_model_selection Train Model Configurations
@defgroup grp_automl AutoML
@}
@defgroup grp_dl_utilities Utilities for Deep Learning
@brief Utilities specific to deep learning workflows.
@details Utilities specific to deep learning workflows.
@{
@defgroup grp_gpu_configuration Show GPU Configuration
@}
@}
@defgroup grp_graph Graph
@brief Graph algorithms and measures associated with graphs.
@details Graph algorithms and measures associated with graphs.
@{
@defgroup grp_apsp All Pairs Shortest Path
@defgroup grp_bfs Breadth-First Search
@defgroup grp_hits HITS
@defgroup grp_graph_measures Measures
@brief A collection of metrics computed on a graph.
@details A collection of metrics computed on a graph.
@{
@defgroup grp_graph_avg_path_length Average Path Length
@defgroup grp_graph_closeness Closeness
@defgroup grp_graph_diameter Graph Diameter
@defgroup grp_graph_vertex_degrees In-Out Degree
@}
@defgroup grp_pagerank PageRank
@defgroup grp_sssp Single Source Shortest Path
@defgroup grp_wcc Weakly Connected Components
@}
@defgroup grp_mdl Model Selection
@brief Functions for model selection and model evaluation.
@details Functions for model selection and model evaluation.
@{
@defgroup grp_validation Cross Validation
@ingroup grp_mdl
@defgroup grp_pred Prediction Metrics
@ingroup grp_mdl
@defgroup grp_train_test_split Train-Test Split
@ingroup grp_mdl
@}
@defgroup grp_sampling Sampling
@brief A collection of methods for sampling from a population.
@details A collection of methods for sampling from a population.
@{
@defgroup grp_balance_sampling Balanced Sampling
@defgroup grp_strs Stratified Sampling
@}
@defgroup grp_stats Statistics
@brief A collection of probability and statistics modules.
@details A collection of probability and statistics modules.
@{
@defgroup grp_desc_stats Descriptive Statistics
@brief Methods to compute descriptive statistics of a dataset.
@details Methods to compute descriptive statistics of a dataset.
@{
@defgroup grp_sketches Cardinality Estimators
@brief Methods to estimate the number of unique values contained in data.
@{
@defgroup grp_countmin CountMin (Cormode-Muthukrishnan)
@defgroup grp_fmsketch FM (Flajolet-Martin)
@defgroup grp_mfvsketch MFV (Most Frequent Values)
@}
@defgroup grp_correlation Covariance and Correlation
@defgroup grp_summary Summary
@}
@defgroup grp_inf_stats Inferential Statistics
@brief Methods to compute inferential statistics of a dataset.
@details Methods to compute inferential statistics of a dataset.
@{
@defgroup grp_stats_tests Hypothesis Tests
@}
@defgroup grp_prob Probability Functions
@}
@defgroup grp_super Supervised Learning
@brief Methods to perform a variety of supervised learning tasks.
@details Methods to perform a variety of supervised learning tasks.
@{
@defgroup grp_crf Conditional Random Field
@defgroup grp_knn k-Nearest Neighbors
@defgroup grp_nn Neural Network
@defgroup grp_regml Regression Models
@brief A collection of methods for modeling conditional expectation of a response variable.
@details A collection of methods for modeling conditional expectation of a response variable.
@{
@defgroup grp_clustered_errors Clustered Variance
@defgroup grp_cox_prop_hazards Cox-Proportional Hazards Regression
@defgroup grp_elasticnet Elastic Net Regularization
@defgroup grp_glm Generalized Linear Models
@defgroup grp_linreg Linear Regression
@defgroup grp_logreg Logistic Regression
@defgroup grp_marginal Marginal Effects
@defgroup grp_multinom Multinomial Regression
@defgroup grp_ordinal Ordinal Regression
@defgroup grp_robust Robust Variance
@}
@defgroup grp_svm Support Vector Machines
@defgroup grp_tree Tree Methods
@brief A collection of recursive partitioning (tree) methods.
@details A collection of recursive partitioning (tree) methods.
@{
@defgroup grp_decision_tree Decision Tree
@defgroup grp_random_forest Random Forest
@}
@}
@defgroup grp_tsa Time Series Analysis
@brief A collection of methods to analyze time series data.
@details A collection of methods to analyze time series data.
@{
@defgroup grp_arima ARIMA
@}
@defgroup grp_unsupervised Unsupervised Learning
@brief A collection of methods for unsupervised learning tasks.
@details A collection of methods for unsupervised learning tasks.
@{
@defgroup grp_association_rules Association Rules
@brief Methods used to discover patterns in transactional datasets.
@details Methods used to discover patterns in transactional datasets.
@{
@defgroup grp_assoc_rules Apriori Algorithm
@}
@defgroup grp_clustering Clustering
@brief Methods for clustering data.
@details Methods for clustering data.
@{
@defgroup grp_kmeans k-Means Clustering
@}
@defgroup grp_pca Dimensionality Reduction
@brief Methods for reducing the number of variables in a dataset to obtain a set of principle variables.
@details Methods for reducing the number of variables in a dataset to obtain a set of principle variables.
@{
@defgroup grp_pca_train Principal Component Analysis
@defgroup grp_pca_project Principal Component Projection
@}
@defgroup grp_topic_modelling Topic Modelling
@brief A collection of methods to uncover abstract topics in a document corpus.
@details A collection of methods to uncover abstract topics in a document corpus.
@{
@defgroup grp_lda Latent Dirichlet Allocation
@}
@}
@defgroup grp_other_functions Utilities
@details Useful utilities for data science workflows.
@{
@defgroup grp_cols2vec Columns to Vector
@defgroup @grp_utilities Database Functions
@defgroup grp_linear_solver Linear Solvers
@brief Methods that implement solutions for systems of consistent linear equations.
@details Methods that implement solutions for systems of consistent linear equations.
@{
@defgroup grp_dense_linear_solver Dense Linear Systems
@defgroup grp_sparse_linear_solver Sparse Linear Systems
@}
@defgroup grp_minibatch_preprocessing Mini-Batch Preprocessor
@defgroup grp_pmml PMML Export
@defgroup grp_text_utilities Term Frequency
@defgroup grp_vec2cols Vector to Columns
@}
@defgroup grp_early_stage Early Stage Development
@details Implementations which are in an early stage of development.
Interface and implementation are subject to change.
@{
@defgroup grp_cg Conjugate Gradient
@defgroup grp_dbscan DBSCAN
@defgroup grp_bayes Naive Bayes Classification
@defgroup grp_sample Random Sampling
@}
@defgroup grp_deprecated Deprecated Modules
Deprecated modules that will be removed in the
next major version (2.0). There are newer MADlib modules
that have replaced these functions.
@{
@defgroup grp_indicator Create Indicator Variables
@defgroup grp_mlogreg Multinomial Logistic Regression
@}
*/