MADlib Release Notes | |

-------------------- | |

These release notes contain the significant changes in each MADlib release, | |

with most recent versions listed at the top. | |

A complete list of changes for each release can be obtained by viewing the git | |

commit history located at https://github.com/apache/madlib/commits/master. | |

Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB. | |

—------------------------------------------------------------------------- | |

MADlib v1.15.1: | |

Release Date: 2018-Oct-XX | |

New features: | |

- Add ubuntu support for MADlib (MADLIB-1256). | |

- Elastic Net: Add grouping by non-numeric column support (MADLIB-1262). | |

- KNN: Accept expressions for point_column_name and test_column_name (MADLIB-1060). | |

- Vec2Cols: Allow arrays of different lengths (MADLIB-1270). | |

- Madpack: Add a script for automating changelist creation. | |

Bug fixes: | |

- Allocator: Remove 16-byte alignment in GPDB 6. | |

- Build: Download compatible Boost if version >= 1.65 (MADLIB-1235). | |

- Build: Remove primary key constraint in IC/DC. | |

- CMake: Fix false positive for Postgres 10+ check. | |

- Graph: Add id of nodes with 0 in-degree (MADLIB-1279). | |

- Margins: Copy summary table instead of renaming (MADLIB-1276). | |

- MLP: Simplify momentum and Nesterov updates (MADLIB-1272). | |

- Upgrade: Fix issue with upgrading RPM to 1.15.1 (MADLIB-1278). | |

- Utilities: Use plpy.quote_ident if available. | |

Others: | |

- Simplify maintenance via removing online examples from sql functions (MADLIB-1260). | |

- Re-enable PCA and PageRank tests (MADLIB-1264). | |

- Build: Disable AppendOnly if available ( MADLIB-1273). | |

- Improve documentation of various modules. | |

—------------------------------------------------------------------------- | |

MADlib v1.15: | |

Release Date: 2018-Aug-15 | |

New features: | |

* MLP: Added momentum and Nesterov's accelerated gradient methods to gradient | |

updates (MADLIB-1210). | |

* New modules: | |

- drop_cols: Create new table from an existing table (CTAS) using an | |

expression of column names (MADLIB-1241). | |

- cols2vec: Create an array from multiple columns (similar to ARRAY[...] | |

with columns obtained using an expression) (MADLIB-1239). | |

- vec2cols: Create multiple columns from an existing array (MADLIB-1240). | |

* Statistics: Added grouping support to correlation and covariance | |

functions (MADLIB-1128). | |

* DT/RF: | |

- Added impurity importance values in DT and RF (MADLIB-1205, 1246, 1249). | |

- Added a new function (get_var_importance) to report importance values | |

in an cleaner interface (MADLIB-925). | |

* Madpack: | |

- Refactored and updated the installation scripts to ensure install, | |

reinstall, install-check are all run from a single SQL file as an atomic | |

operation (MADLIB-1242). | |

- Moved most of install-check operations to a new "dev-check", making | |

install-check smaller and faster to run. | |

- Added new option to run unit-tests (MADLIB 1251, 1252). | |

Bug fixes: | |

- Fixed an ABI issue that prevented compiling MADlib on GCC 5+ | |

(MADLIB-1025). | |

- Decision trees: | |

- Fixed a minor bug that prevented sparse vector to float8[] | |

(MADLIB-1234). | |

- Fixed a bug that led to dependent type being obtained from a NULL | |

value (MADLIB-1233). | |

- Summary table has been updated to ensure correct feature names are | |

populated (MADLIB-1236). | |

- Fixed incorrect indexing of trueChild and falseChild in surrogate | |

agreement calculation. | |

- Removed categorical variable elimination to avoid issues with varying | |

categorical variables for different groups (MADLIB-1258, 1254). | |

- Logregr: Fixed issue where an output table could be empty for grouping | |

(MADLIB-1172). | |

- Added special characters support for multiple modules | |

(MADLIB-1237, 1238, 1243). | |

- Build: Removed invalid symlinks left behind after an uninstall | |

(MADLIB-1175). | |

- Updated SVM to correctly report loss per row instead of total loss. | |

- Refactored internal CV function to fix multiple issues with cross | |

validation on SVM (MADLIB-1250). | |

- Worked-around a "cache lookup" issue that prevented dropping of | |

install-check user (MADLIB-1014). | |

- Pagerank: Removed duplicate entries from grouping output | |

(MADLIB-1229, 1253). | |

- Madpack: Install-check user is dropped even after an IC failure | |

(MADLIB-1182). | |

Others: | |

- Removed HAWQ support from all modules | |

—------------------------------------------------------------------------- | |

MADlib v1.14: | |

Release Date: 2018-April-28 | |

New features: | |

* New module - Balanced datasets: A sampling module to balance classification | |

datasets by resampling using various techniques including undersampling, | |

oversampling, uniform sampling or user-defined proportion sampling | |

(MADLIB-1168) | |

* Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor function | |

necessary to create batches from the data (MADLIB-1200, MADLIB-1206, | |

MADLIB-1220, MADLIB-1224, MADLIB-1226, MADLIB-1227) | |

* k-NN: Added weighted averaging/voting by distance (MADLIB-1181) | |

* Summary: Added additional stats: number of positive, negative, zero values and | |

95% confidence intervals for the mean (MADLIB-1167) | |

* Encode categorical: Updated to produce lower-case column names when possible | |

(MADLIB-1202) | |

* MLP: Added support for already one-hot encoded categorical dependent variable | |

in a classification task (MADLIB-1222) | |

* Pagerank: Added option for personalized vertices that allows higher weightage | |

for a subset of vertices which will have a higher jump probability as | |

compared to other vertices and a random surfer is more likely to | |

jump to these personalization vertices (MADLIB-1084) | |

Bug fixes: | |

- Fixed issue with invalid calls of construct_array that led to problems | |

in Postgresql 10 (MADLIB-1185) | |

- Added newline between file concatenation during PGXN install (MADLIB-1194) | |

- Fixed upgrade issues in knn (MADLIB-1197) | |

- Added fix to ensure RF variable importance are always non-negative | |

- Fixed inconsistency in LDA output and improved usability | |

(MADLIB-1160, MADLIB-1201) | |

- Fixed MLP and RF predict for models trained in earlier versions to | |

ensure missing optional parameters are given appropriate default values | |

(MADLIB-1207) | |

- Fixed a scenario in DT where no features exist due categorical columns | |

with single level being dropped led to the database crashing | |

- Fixed step size initialization in MLP based on learning rate policy | |

(MADLIB-1212) | |

- Fixed PCA issue that leads to failure when grouping column is a TEXT type | |

(MADLIB-1215) | |

- Fixed cat levels output in DT when grouping is enabled (MADLIB-1218) | |

- Fixed and simplified initialization of model coefficients in MLP | |

- Removed source table dependency for predicting regression models in MLP | |

(MADLIB-1223) | |

- Print loss of first iteration in MLP (MADLIB-1228) | |

- Fixed MLP failure on GPDB 4.3 when verbose=True (MADLIB-1209) | |

- Fixed RF issue that showed up when var_importance=True with no continuous | |

features (MADLIB-1219) | |

- Fixed DT/RF issue for null_as_category=True and grouping enabled | |

(MADLIB-1217) | |

Other: | |

- Reduced install-check runtime for PCA, DT, RF, elastic net (MADLIB-1216) | |

- Added CentOS 7 PostgreSQL 9.6/10 docker files | |

—------------------------------------------------------------------------- | |

MADlib v1.13: | |

Release Date: 2017-December-22 | |

New features: | |

* New module: Graph - HITS (MADLIB-1124, MADLIB-1151) | |

* k-NN: | |

- Added additional distance metrics (MADLIB-1059) | |

- Added list of neighbors in output table (MADLIB-1129) | |

* MLP: Added grouping support (MADLIB-1149) | |

* Cross Validation: Improved the stats reporting in output table (MADLIB-1169) | |

* Correlation: Improved quality of results by ignoring only a NULL value and | |

not the whole row containing the NULL (MADLIB-1166) | |

Bug fixes: | |

- Fixed issue with Decision Trees (DT) trained in older versions not | |

being usable in predict of v1.12 (MADLIB-1161) | |

- Fixed invalid assert statement in DT (MADLIB-1164) | |

- Improved feature array handling in DT (MADLIB-1173) | |

- Fixed install-check failures on non-default schema installation (MADLIB-1177, 1184) | |

Other: | |

- Updated PyXB from 1.2.4 to 1.2.6. (MADLIB-1103) | |

This change eliminates the need to remove part of PyXB code base as a | |

GPL-workaround. | |

- Updated the naming for gppkg (MADLIB-1183) | |

—------------------------------------------------------------------------- | |

MADlib v1.12: | |

Release Date: 2017-August-18 | |

New features: | |

* New module: Graph - All Pairs Shortest Path (MADLIB-1072, MADLIB-1099, MADLIB-1106) | |

* New module: Graph - Weakly Connected Components (MADLIB-1071, MADLIB-1083, MADLIB-1101) | |

* New module: Graph - Breadth First Search (MADLIB-1102) | |

* New module: Graph - Measures (MADLIB-1073) | |

* New Module: Sample - Stratified Sampling (MADLIB-986) | |

* New Module: Sample - Train-test split (MADLIB-1119) | |

* New Module: Multilayer Perceptron (MADLIB-413, MADLIB-1134) | |

* DT and RF: | |

- Allow expressions in feature list (MADLIB-1087) | |

- Allow array input for features (MADLIB-965) | |

- Filter NULL dependent values in OOB (MADLIB-1097) | |

- Add option to treat NULL as category | |

* Summary: | |

- Allow user to determine the number of columns per run (MADLIB-1117) | |

- Improve efficiency of computation time by ~35% (MADLIB-1104) | |

* Sketch: | |

- Promote cardinality estimators to top level module from early stage (MADLIB-1120) | |

* Add basic code coverage support (MADLIB-1138) | |

* Updates for Apache Top Level Project readiness (MADLIB-1112, MADLIB-1130, MADLIB-1133, MADLIB-1142) | |

Bug fixes: | |

- DT and RF: | |

- Fix array to string conversion with CV | |

- Include NULL rows in count for termination check | |

- Sketch: | |

- Remove per-tuple checks for better performance | |

- PageRank: | |

- Fix multiple bugs and perf issue in grouping (MADLIB-1100, MADLIB-1107) | |

- Kmeans: | |

- Fix IC drop table statements | |

- Graph: | |

- Fix quoted output table name bug (MADLIB-1137) | |

- Fix empty string arguments bug | |

- Elastic Net: | |

- Fix the data scaling bug with normalization (MADLIB-1094) | |

- Reduce the tolerance for a faster IC test (MADLIB-1118) | |

- Control: | |

- Update 'optimizer' GUC only if editable (MADLIB-1109) | |

Other: | |

- Build: Add CDATA block to avoid invalid xml | |

- Multiple user documentation improvements | |

—------------------------------------------------------------------------- | |

MADlib v1.11: | |

Release Date: 2017-May-05 | |

New features: | |

* New module: Graph - PageRank | |

- Implements the original PageRank algorithm that assumes a random surfer model | |

(https://en.wikipedia.org/wiki/PageRank#Damping_factor) (MADLIB-1069) | |

- Grouping support is included for PageRank (MADLIB-1082) | |

* Graph - Single Source Shortest Path (SSSP): Add grouping support (MADLIB-1081) | |

* Pivot: Add support for array and svec output types (MADLIB-1066) | |

* DT and RF: | |

- Change default values for 2 parameters (max_depth and num_splits) | |

- Reduce memory footprint: Assign memory only for reachable nodes (MADLIB-1057) | |

- Include rows with NULL features in training (MADLIB-1095) | |

- Update error message for invalid parameter specification (num_splits) | |

* Array Operations: Add function to unnest 2-D arrays by one level into rows | |

of 1-D arrays (MADLIB-1086) | |

* Build process on Apache infrastructure (MADLIB-920, MADLIB-1080) | |

* Updates for Apache Top Level Project readiness (MADLIB-1022, MADLIB-1076, | |

MADLIB-1077, MADLIB 1090) | |

* Support for GPDB 5.0 | |

Bug fixes: | |

- DT and RF: | |

- Fix accuracy issues related to integer categorical variables and tree depth | |

- Improve visualization of tree(s) | |

- Elastic Net: | |

- Fix install check on GPDB 5.0 and HAWQ 2.2 (MADLIB-1088) | |

- Fix inconsistent results with grouping (MADLIB-1092) | |

- PCA: Fix install check | |

Other: | |

- PMML: Skip install check when run without the ‘-t’ option (MADLIB-1078) | |

- Multiple user documentation improvements | |

—------------------------------------------------------------------------- | |

MADlib v1.10.0 | |

Release Date: 2017-February-17 | |

New features: | |

* New module: Graph - Single Source Shortest Path (SSSP) (MADLIB-992) | |

- Calculate the shortest path from a given vertex to every vertex in the graph. | |

* New module: Encode categorical variables (MADLIB-1038) | |

- Completely new version for dummy/one-hot encoding of categorical variables | |

with new name and different arguments. | |

- Previous version has been deprecated. | |

* New module (early stage): K-Nearest Neighbors (KNN) (MADLIB-927) | |

- Find the k nearest neighbors based on the squared_dist_norm2 metric. | |

* Elastic Net: Add grouping support (MADLIB-950) | |

- Elastic net train for both Gaussian and Binomial models, with FISTA | |

and IGD optimizations support grouping. | |

- Use active sets for FISTA, but active sets are used only after the | |

log-likelihood of all the groups becomes 0. | |

* Elastic Net: Add cross validation (MADLIB-996) | |

* PCA: Add grouping support (MADLIB-947) | |

* PCA: Removed column id restriction. | |

* Kmeans: Cluster variance for PivotalR support. | |

* Kmeans: Support for array input. (MADLIB-1018) | |

* DT and RF: Verbose option for the dot output format. (MADLIB-1051) | |

* Association Rules: Add rule counts and limit itemset size feature | |

(MADLIB-1044, MADLIB-1031) | |

* Boost library has been upgraded from 1.47 to 1.61 | |

* Multiple improvements to the build system (madpack, cmake etc.) to support | |

Semantic versioning and various versions of GPDB and HAWQ. | |

Bug fixes: | |

- Pivot: Adjust the warning level to remove redundant messages. | |

- RF: Fix the online help and examples. | |

- Utilities: Fix incorrect flag for distribution. | |

- Install check: Update date format and remove hardcoded schema names. | |

- Multiple user documentation improvements. | |

—------------------------------------------------------------------------- | |

MADlib v1.9.1 | |

Release Date: 2016-August-25 | |

New features: | |

* New function: One class SVM (MADLIB-990) | |

- Added a one-class SVM that classifies new data as similar or different to | |

the training set. | |

- This method is an unsupervised method that builds a decision boundary | |

between the data and origin in kernel space and can be used as a novelty | |

detector. | |

* SVM: Added functionality to assign weights to each class, simplying | |

classification of unbalanced data. (MADLIB-998) | |

* New function: Prediction metrics (MADLIB-907) | |

Added a collection of summary statistics to gauge model accuracy based on | |

predicted values vs. ground-truth values. | |

* New function: Sessionization (MADLIB-909, MADLIB-1001) | |

Added a sessionize function to perform session reconstruction on a data | |

set so it can be prepared for input into other algorithms such as | |

path functions or predictive analytics algorithms. | |

* New function: Pivot (MADLIB-908, MADLIB-1004) | |

Added a function to that can do basic OLAP type operations on data stored | |

in one table and output the summarized data to a second table. | |

* Path: Major performance improvement (MADLIB-984) | |

* Path: Add support for overlapping patterns (MADLIB-995) | |

* Build: Add support for PG 9.5 and 9.6 (MADLIB-944) | |

* PGXN: Update PostgreSQL Extension Network to latest release (MADLIB-959) | |

Bug fixes: | |

- Random Forest: Fix filtered feature related bug (MADLIB-928) | |

- Elastic Net: Skip arrays with NULL values in train (MADLIB-978) | |

- Matrix: Fix starting index in extract functions (MADLIB-1006) | |

- Path: Allow multiple expressions in partition expression (MADLIB-1003) | |

- DT: Fix bin computation for boolean features (MADLIB-1011) | |

- Multiple user documentation improvements (MADLIB-1001) | |

—------------------------------------------------------------------------- | |

MADlib v1.9 | |

Release Date: 2016-April-04 | |

New features: | |

* New module: Path | |

- Perform pattern matching over a sequence of rows and extracts useful | |

information about the pattern matches. | |

- Useful in a wide variety of use cases: on-line shopping, predictive | |

maintenance, cyber security, IoT, customer churn, etc. | |

- Define arbitrarily complex symbols to identify rows of interest. | |

- Perform regular pattern matching of symbols over a sequence of ordered partitions. | |

- Extract useful information about the pattern matches (counts, | |

aggregations, window functions). | |

* New module: Support Vector Machines (SVM) | |

- Complete rewrite of SVM algorithm to improve accuracy and performance. | |

- Support for classification and regression. | |

- Support for non-linear kernels (Gaussian and Polynomial). | |

- Cross validation support on parameters: lambda, epsilon, initial step size, | |

maximum iterations, and decay factor. | |

* New module: Stemmer function | |

- Compute the root of any English text input using Porter2 stemming algorithm. | |

* New matrix operations (Phase 2) | |

- Added following operations/functions for dense and sparse matrices: | |

- Representation: get matrix dimensions | |

- Extraction/visitor methods: extract diagonal elements | |

- Reduction operations: compute matrix norm | |

- Creation methods: initialize with ones, initialize with zeros, | |

square identity matrix, diagonal matrix, sample from distribution | |

(Normal, Uniform, Bernoulli) | |

- Decomposition operations: inverse, generic inverse, eigen extraction, | |

Cholesky decomposition, QR decomposition, LU decomposition, nuclear norm, rank | |

* Pearson's correlation module: added option to return the covariance matrix | |

* PCA: added option to use proportion of variance to determine number of | |

principle components to return (MADLIB-948) | |

* PivotalR support for Latent Dirichlet Allocation (LDA) | |

* Quotation and international character support (Phase 2) | |

- All modules now support table and column names that are quoted and | |

contain international characters. This release adds support for: | |

- Cross Validation | |

- Dense Linear Systems | |

- Sparse Linear Systems | |

- Low-rank Matrix Factorization | |

- Conditional Random Field | |

- Hypothesis Tests | |

- Support Modules/Data Preparation | |

- Support Modules/PMML Export | |

- ARIMA | |

* New platform: | |

- Added support for HAWQ 2.0 | |

* Miscellaneous: | |

- Updated documentation and more examples | |

- Term frequency: added support for custom column names | |

- Updated licensing files and headers to comply with ASF regulations | |

Bug fixes: | |

- Elastic Net: Skips arrays with NULL values in predict (MADLIB-919) | |

- Hello World example: Fixed 'this' pointer errors (MADLIB-967) | |

- Hypothesis tests: Fixed docs and examples (MADLIB-895) | |

- Matrix: Fixed inconsistent type in drop statements | |

- Decision Tree: Fixed format specifier in online help (MADLIB-968) | |

- Minor: Updated volatile install-check | |

- LDA: Fixed the padding for LDA model | |

- Decision tree: Fixed to cast count(*) output to long (MADLIB-917) | |

- Validation: Fixed varchar array error in install-check | |

- Matrix: Fixed multiple input/output issues (MADLIB-932) | |

- Matrix: Fixed minor issue with sparse LU output | |

- Summary: Fixed the case for unquoted table names by moving the compare to | |

SQL (MADLIB-954) | |

- Correlation: Fixed to return columns sorted in ordinal position. (MADLIB-941) | |

- Elastic Net: Removed the enforcement of same numeric type while keeping the | |

error for non-numeric types. (MADLIB-952) | |

- K-means: Fixed the error caused by a null value in the matrix or vector. | |

(MADLIB-946) | |

-------------------------------------------------------------------------------- | |

MADlib v1.8 | |

Release Date: 2015-July-17 | |

New features: | |

* Improved Latent Dirichlet Allocation (LDA) Performance | |

- Function lda_train() is about twice as fast. | |

- Improved the scalability of the function | |

(vocabulary size x number of topics can be up to 250 million). | |

* New module: Matrix operations | |

Added the following operations/functions for dense and sparse matrices: | |

- Mathematical operations: addition, subtraction, multiplication, | |

element-wise multiplication, scalar and vector multiplication. | |

- Aggregation operations: apply various operations including | |

max, min, sum, mean along a specified dimension. | |

- Visitor methods: extract row/column from matrix. | |

- Representation: convert a matrix to either dense or sparse representation. | |

* Quotation and International Character Support | |

- Most modules now support table and column names that are quoted and | |

contain international characters, including: | |

- Regression models (GLMs, linear regression, elastic net, etc.) | |

- Decision trees and random forests | |

- Unsupervised learning models (association rules, k-means, LDA, etc.) | |

- Summary, Pearson's correlation, and PCA | |

* Array Norms and Distances | |

- Generic p-norm distance | |

- Jaccard distance | |

- Cosine similarity | |

* Text Analysis: | |

- Text utility for term frequency and vacabulary construction (prepares | |

documents for input to LDA). | |

* Miscellaneous | |

- Improved organization of User and Developer guide at doc.madlib.net/latest. | |

- Low-rank matrix factorization: added 32-bit integer aupport (MADLIB-903). | |

- Cross-validation: added classification support (MADLIB-908). | |

- Added a new clean-up function for removing MADlib temporary tables. | |

Note: | |

- LDA models that are trained using MADlib v1.7.1 or earlier need to be | |

re-trained to be used in MADlib v1.8. | |

Known issues: | |

- Performance for decision tree with cross-validation is poor on a HAWQ | |

multi-node system. | |

-------------------------------------------------------------------------------- | |

MADlib v1.7.1 | |

Release Date: 2015-March-18 | |

New features: | |

* Random Forest Performance Improvement | |

- Function forest_train() is 1.5X ~ 4X faster without variable importance, | |

and up to 100X faster with variable importance | |

- Function forest_predict() is up to 10X faster when type='response' | |

- Allow user-specified sample ratio to train with a small subsample | |

* Gaussian Naive Bayes: allow continuous variables | |

* K-Means: Allow user-specified sample ratio for K-means++ seeding | |

* Miscellaneous | |

- Array functions: array_square() for element-wise square, madlib.sum() | |

for array element-wise aggregation | |

- Madpack does not require password when not necessary (MADLIB-357) | |

- Platform support of PostgreSQL 9.4 and HAWQ 1.3 | |

- Allow views and materialized views for training functions | |

- Support quantile computation in summary functions for HAWQ and PG 9.4 | |

Bug fixes: | |

- Fixed the support of multiple parameter values and NULL in general | |

cross-validation (MADLIB-898, MADLIB-896) | |

- Fixed infinite loop when detecting recursive view-to-view dependencies for | |

upgrading (MADLIB-901) | |

- Allow user-specified column names in PCA and multinom_predict() | |

Known issues: | |

- Performance for decision tree with cross-validation is poor on a HAWQ | |

multi-node system. | |

-------------------------------------------------------------------------------- | |

MADlib v1.7 | |

Release Date: 2014-December-31 | |

New features: | |

* Generalized Linear Model: | |

- Added a new generic module for GLM functions that allow for response | |

variables that have arbitrary distributions (rather than simply | |

Gaussian distributions), and for an arbitrary function of the response | |

variable (the link function) to vary linearly with the predicted values | |

(rather than assuming that the response itself must vary linearly). | |

- Available distribution families: gaussian (link functions: identity, | |

inverse and log), binomial (link functions: probit and logit), | |

poisson (link functions: log, identity and square-root), gamma (link | |

functions: inverse, identity and log) and inverse gaussian (link functions: | |

square-inverse, inverse, identity and log). | |

- Deprecated 'mlogregr_train' in favor of 'multinom' available as part of | |

the new GLM functionality. | |

- Added a new 'ordinal' function for ordered logit and probit regression. | |

* Decision Tree: Reimplemented the decision tree module which includes following | |

changes: | |

- Improved usability due to a new interface. | |

- Performance enhancements upto 40 times faster than the old interface. | |

- Additional features like pruning methods, surrogate variables for | |

NULL handling, cross validation, and various new tree tuning parameters. | |

- Addition of a new display function to visualize the trained tree and new | |

prediction function for scoring of new datasets. | |

* Random Forest: Reimplemented the random forest module which includes following | |

changes: | |

- New random forest module based on the new decision tree module. | |

- Better variable importance metrics and ability to explore each tree | |

in the forest independently. | |

- Ability to get class probabilities of all classes and not just the max | |

class during prediction. | |

- Improved visualization with export capabilities using Graphviz dot format. | |

* PMML: | |

- Upgraded compatible PMML version to 4.1. | |

- Moved PMML export out of early stage development with new functionality | |

available to export GLM, decision tree, and random forest models. | |

* Updated Eigen from 3.1.2 to 3.2.2. | |

* Updated PyXB from 1.2.3 to 1.2.4. | |

* Added finer granularity control for running specific install-check tests. | |

Bug fixes: | |

- Fixed bug in K-means allowing use of user-defined metric functions | |

(MADLIB-874, MADLIB-875). | |

- Fixed issues related to header files included in the build system | |

(MADLIB-855, MADLIB-879, MADLIB-884). | |

Known issues: | |

- Performance for decision tree with cross-validation is poor on a HAWQ | |

multi-node system. | |

-------------------------------------------------------------------------------- | |

MADlib v1.6 | |

Release Date: 2014-June-30 | |

New features: | |

- Added a new unified 'margins' function that computes marginal effects for | |

linear, logistic, multilogistic, and cox proportional hazards regression. The | |

new function also introduces support for interaction terms in the independent | |

array. | |

- Updated convergence for 'elastic_net_train' by checking the change in the | |

loglikelihood instead of the l2-norm of the change in coefficients. This allows | |

for faster convergence in problems with multiple optimal solutions. | |

The default threshold for convergence has been reduced from 1e-4 to 1e-6. | |

- Added a new helper function to convert categorical variables to indicator | |

variables which can be used directly in regression methods. The function | |

currently only supports dummy encoding. | |

- Improved performance for cox proportional hazards: average improvement of | |

20 fold on GPDB and 2.5 fold on HAWQ. | |

- Improved performance on ARIMA by 30%. | |

- Added new functionality to export linear and logistic regression models as a | |

PMML object. The new module relies on PyXB to create PMML elements. | |

- Added a function ('array_scalar_add') to 'add' a scalar to an array. | |

- Added 'numeric' type support for all functions that take 'anyarray' as | |

argument. | |

- Made usability and aesthetic enhancements to documentation. | |

Bug Fixes: | |

- Prepended python module name to sys.path before executing madlib function | |

to avoid conflicts with user-defined modules. | |

- Added a check in K-Means to ensure dimensionality of all data points are | |

the same and also equal to the dimensionality of any provided initial centroids | |

(MADLIB-713, MADLIB-789). | |

- Added a check in multinomial regression to quit early and cleanly if model | |

size is greater than the maximum permissible memory (MADLIB-667). | |

- Fixed a minor bug with incorrect column names in the decision trees module | |

(MADLIB-763). | |

- Fixed a bug in Kmeans that resulted in incorrect number of centroids for | |

particular datasets (MADLIB-857). | |

- Fixed bug when grouping columns have same name as one of the output table | |

column names (MADLIB-833). | |

Deprecated Functions: | |

- Modules profile and quantile have been deprecated in favor of the 'summary' | |

function. | |

- Module 'svd_mf' has been deprecated in favor of the improved 'svd' function. | |

- Functions 'margins_logregr' and 'margins_mlogregr' have been deprecated in | |

favor of the 'margins' function. | |

-------------------------------------------------------------------------------- | |

MADlib v1.5 | |

Release Date: 2014-Mar-05 | |

New features: | |

- Added a new port 'HAWQ'. MADlib can now be used with the Pivotal | |

Distribution of Hadoop (PHD) through HAWQ | |

(see http://www.gopivotal.com/big-data/pivotal-hd for more details). | |

- Implemented performance improvements for linear and logistic predict functions. | |

- Moved Conditional Random Fields (CRFs) out of early stage development, and | |

updated the design and APIs for to enable ease of use and better functionality. | |

API changes include lincrf replaced by lincrf_train, crf_train_fgen and | |

crf_test_fgen with updated arguments, and format of segment tables. | |

- Improved linear support vector machines (SVMs) by enabling iterations, and | |

removed lsvm_predict and svm_predict, which are not useful in GPDB and HAWQ. | |

- Added new functions, with improved performance compared to svec_sfv, for | |

document vectorization into sparse vectors. | |

- Removed the bool-to-text cast and updated all functions depending on it to | |

explicitly convert variable to text. | |

- Added function properties for all SQL functions to allow the database optimizer | |

to make better plans. | |

Bug Fixes: | |

- Set client_min_messages to 'notice' during database installation to ensure | |

that log messages don't get logged to STDERR. | |

- Fixed elastic net prediction to predict using all features instead of just | |

the selected features to avoid an error when no feature is selected as relevant | |

in the trained model. | |

- For corner probability values, p=0 and p=1, in bernoulli and binomial | |

distributions, the quantile values should be 0 and num_of_trials (=1 in the case | |

of bernoulli) respectively, independent of the probability of success. | |

- Changed install script to explicitly use /bin/bash instead of /bin/sh to avoid | |

problems in Ubuntu where /bin/sh is linked to 'dash'. | |

- Fixed issue in Elastic Net to take any array expression as input instead of | |

specifically expecting the expression 'ARRAY[...]'. | |

- Fixed wrong output in percentile of count-min (CM) sketches. | |

Known issues: | |

- Elastic net prediction wrapper function elastic_net_prediction is not | |

available in HAWQ. Instead, prediction functionality is available for both | |

families via elastic_net_gaussian_predict and elastic_net_binomial_predict. | |

- Distance metrics functions in K-Means for the HAWQ port are restricted to the | |

in-built functions, specifically squaredDistNorm2, distNorm2, distNorm1, | |

distAngle, and distTanimoto. | |

- Functions in Quantile and Profile modules of Early Stage Development are not | |

available in HAWQ. Replacement of these functions is available as built-in | |

functions (percentile_cont) in HAWQ and Summary module in MADlib, respectively. | |

-------------------------------------------------------------------------------- | |

MADlib v1.4.1 | |

Release Date: 2013-Dec-13 | |

Bug Fixes: | |

- Fixed problem in Elastic Net for 'binomial' family if an 'integer' column was | |

passed for dependent variable instead of a 'boolean' column. | |

- '*' support in Elastic Net lacked checks for the columns being combined. Now | |

we check if the column for '*' is already an array, in which case we don't wrap | |

it with an 'array' modifier. If there are multiple columns we check that they | |

are of the same numeric type before building an array. | |

- Fixed a software regression in Robust Variance, Clustered Variance and | |

Marginal Effects for multinomial regression introduced in v1.4 when | |

output table name is schema-qualified. | |

- We now also support schema-qualified output table prefixes for SVD and PCA. | |

- Added warning message when deprecated functions are run. Also added a list of | |

deprecated functions in the ReadMe. | |

- Added a Markdown Readme along with the text version for better rendering on | |

Github. | |

-------------------------------------------------------------------------------- | |

MADlib v1.4 | |

Release Date: 2013-Nov-25 | |

New Features: | |

* Improved interface for Multinomial logistic regression: | |

- Added a new interface that accepts an 'output_table' parameter and | |

stores the model details in the output table instead of returning as a struct | |

data type. The updated function also builds a summary table that includes | |

all parameters and meta-parameters used during model training. | |

- The output table has been reformatted to present the model coefficients | |

and related metrics for each category in a separate row. This replaces the | |

old output format of model stats for all categories combined in a | |

single array. | |

* Variance Estimators | |

- Added Robust Variance estimator for Cox PH models (Lin and Wei, 1989). | |

It is useful in calculating variances in a dataset with potentially | |

noisy outliers. Namely, the standard errors are asymptotically normal even | |

if the model is wrong due to outliers. | |

- Added Clustered Variance estimator for Cox PH models. It is used | |

when data contains extra clustering information besides covariates and | |

are asymptotically normal estimates. | |

* NULL Handling: | |

- Modified behavior of regression modules to 'omit' rows containing NULL | |

values for any of the dependent and independent variables. The number of | |

rows skipped is provided as part of the output table. | |

This release includes NULL handling for following modules: | |

- Linear, Logistic, and Multinomial logistic regression, as well as | |

Cox Proportional Hazards | |

- Huber-White sandwich estimators for linear, logistic, and multinomial | |

logistic regression as well as Cox Proportional Hazards | |

- Clustered variance estimators for linear, logistic, and multinomial | |

logistic regression as well as Cox Proportional Hazards | |

- Marginal effects for logistic and multinomial logistic regression | |

Deprecated functions: | |

- Multinomial logistic regression function has been renamed to | |

'mlogregr_train'. Old function ('mlogregr') has been deprecated, | |

and will be removed in the next major version update. | |

- For all multinomial regression estimator functions (list given below), | |

changes in the argument list were made to collate all optimizer specific | |

arguments in a single string. An example of the new optimizer parameter is | |

'max_iter=20, optimizer=irls, precision=0.0001'. | |

This is in contrast to the original argument list that contained 3 arguments: | |

'max_iter', 'optimizer', and 'precision'. This change allows adding new | |

optimizer-specific parameters without changing the argument list. | |

Affected functions: | |

- robust_variance_mlogregr | |

- clustered_variance_mlogregr | |

- margins_mlogregr | |

Bug Fixes: | |

- Fixed an overflow problem in LDA by using INT64 instead of INT32. | |

- Fixed integer to boolean cast bug in clustered variance for logistic | |

regression. After this fix, integer columns are accepted for binary | |

dependent variable using the 'integer to bool' cast rules. | |

- Fixed two bugs in SVD: | |

- The 'example' option for online help has been fixed | |

- Column names for sparse input tables in the 'svd_sparse' and | |

'svd_sparse_native' functions are no longer restricted to 'row_id', | |

'col_id' and 'value'. | |

-------------------------------------------------------------------------------- | |

MADlib v1.3 | |

Release Date: 2013-October-03 | |

New Features: | |

* Cox Proportional Hazards: | |

- Added stratification support for Cox PH models. Stratification is used as | |

shorthand for building a Cox model that allows for more than one stratum, | |

and hence, allows for more than one baseline hazard function. | |

Stratification provides two pieces of key, flexible functionality for the | |

end user of Cox models: | |

-- Allows a categorical variable Z to be appropriately accounted for in | |

the model without estimating its predictive impact on the response | |

variable. | |

-- Categorical variable Z is predictive/associated with the response | |

variable, but Z may not satisfy the proportional hazards assumption | |

- Added a new function (cox_zph) that tests the proportional hazards | |

assumption of a Cox model. This allows the user to build Cox models and then | |

verify the relevance of the model. | |

* NULL Handling: | |

- Modified behavior of linear and logistic regression to 'omit' rows | |

containing NULL values for any of the dependent and independent variables. | |

The number of rows skipped is provided as part of the output table. | |

Deprecated functions: | |

- Cox Proportional Hazard function has been renamed to 'coxph_train'. | |

Old function names ('cox_prop_hazards' and 'cox_prop_hazards_regr') | |

have been deprecated, and will be removed in the next major version update. | |

- The aggregate form of linear regression ('linregr') has been deprecated. | |

The stored-procedure form ('linregr_train') should be used instead. | |

Bug Fixes: | |

- Fixed a memory leak in the Apriori algorithm. | |

-------------------------------------------------------------------------------- | |

MADlib v1.2 | |

Release Date: 2013-September-06 | |

New Features: | |

* ARIMA Timeseries modeling | |

- Added auto-regressive integrated moving average (ARIMA) modeling for | |

non-seasonal, univariate timeseries data. | |

- Module includes a training function to compute an ARIMA model and a | |

forecasting function to predict future values in the timeseries | |

- Training function employs the Levenberg-Marquardt algorithm (LMA) to | |

compute a numerical solution for the parameters of the model. The | |

observations and innovations for time before the first timestamp | |

are assumed to be zero leading to minimization of the conditional sum of | |

squares. This produces estimates referred to as conditional maximum likelihood | |

estimates (also referred as 'CSS' in some statistical packages). | |

* Documentation updates: | |

- Introduced a new format for documentation improving usability. | |

- Upgraded to Doxygen v1.84. | |

- Updated documentation improving consistency for multiple modules including | |

Regression methods, SVD, PCA, Summary function, and Linear systems. | |

Bug fixes: | |

- Checking out-of-bounds access of a 'svec' even if the size of svec is zero. | |

- Fixed a minor bug allowing use of GCC 4.7 and higher to build from source. | |

-------------------------------------------------------------------------------- | |

MADlib v1.1 | |

Release Date: 2013-August-09 | |

New Features: | |

* Singular Value Decomposition: | |

- Added Singular Value Decomposition using the Lanczos bidiagonalization | |

iterative method to decompose the original matrix into PBQ^t, where B is | |

a bidiagonalized matrix. We assume that the original matrix is too big to | |

load into memory but B can be loaded into the memory. B is then further | |

decomposed into XSY^T using Eigen's JacobiSVD function. This restricts the | |

number of features in the data matrix to about 5000. | |

- This implementation provides SVD (for dense matrix), SVD_BLOCK (also for | |

dense matrix but faster), SVD_SPARSE (convert a sparse matrix into a | |

dense one, slower) and SVD_SPARSE_NATIVE (directly operate on the sparse | |

matrix, much faster for really sparse matrices). | |

* Principal Component Analysis: | |

- Added a PCA training function that generates the top-K principal | |

components for an input matrix. The original data is mean-centered by the | |

function with the mean matrix returned by the function as a separate table. | |

- The module also includes the projection function that projects a test data | |

set to the principal components returned by the train function. | |

* Linear Systems: | |

- Added a module to solve linear system of equations (Ax = b). | |

- The module utilizes various direct methods from the Eigen library for | |

dense systems. Given below is a summary of the methods (more details at | |

http://eigen.tuxfamily.org/dox-devel/group__TutorialLinearAlgebra.html): | |

- Householder QR | |

- Partial Pivoting LU | |

- Full Pivoting LU | |

- Column Pivoting Householder QR | |

- Full Pivoting Householder QR | |

- Standard Cholesky decomposition (LLT) | |

- Robust Cholesky decomposition (LDLT) | |

- The module also includes direct and iterative methods for sparse linear | |

systems: | |

Direct: | |

- Standard Cholesky decomposition (LLT) | |

- Robust Cholesky decomposition (LDLT) | |

Iterative: | |

- In-memory Conjugate gradient | |

- In-memory Conjugate gradient with diagonal preconditioners | |

- In-memory Bi-conjugate gradient | |

- In-memory Bi-conjugate gradient with incomplete LU preconditioners | |

Bug fixes and other changes: | |

* Robust input validation: | |

- Validation of input parameters to various functions has been improved to | |

ensure that it does not fail if double quotes are included as part of the | |

table name. | |

* Random Forest | |

- The ID field in rf_train has been expanded from INT to BIGINT (MADLIB-764) | |

* Various documentation updates: | |

- Documentation updated for various modules including elastic net, linear | |

and logistic regression. | |

-------------------------------------------------------------------------------- | |

MADlib v1.0 | |

Release Date: 2013-July-03 | |

New Features: | |

* Cox Proportional Hazards: | |

- Added Right Censoring support for Cox Prop Hazards | |

* Robust Variance Tests - Huber White: | |

- Added a method of calculating robust variance statistic by utilizing the | |

Huber-White sandwich estimator for linear regression, logistic regression, | |

and multinomial logistic regression | |

- Robust variance for linear and logistic regression also includes | |

grouping support | |

* Clustered Sandwich Estimators: | |

- Added clustered robust variance statistic by utilizing a clustered sandwich | |

estimator for linear regression, logistic regression, and multinomial | |

logistic regression | |

- Grouping is currently not implemented for clustered and parameter is only | |

a placeholder at present | |

* Marginal Effects Estimator: | |

- Added a method for computing the marginal effects for logistic regression | |

and multinomial logistic regression | |

- Grouping is currently not implemented for marginal effects and the | |

parameter is only a placeholder at present | |

* Multinomial logistic regression: | |

- Added a parameter in multinomial logistic regression, to enable picking | |

the reference category. Input for number of categories has been removed | |

due to redundancy | |

* Linear regression: | |

- Updated grouping columns to input as a comma delimited string rather | |

than as an array | |

- Resolved an issue with highly collinear data to produce results consistent | |

with other statistical packages. Threshold on condition number to use an | |

approximation for computing the pseudo-inverse was increased. | |

* Logistic regression: | |

- Changed behavior to error-out if the ouput table already exists | |

Bug fixes: | |

* Summary: | |

- Summary function (when used with quartiles) used high memory when number | |

of column is large. This has been fixed by computing quartiles in an | |

iterative manner for a fixed number of columns (Pivotal-170) | |

- Fixed a problem with incorrect number of rows returned for Summary when | |

all values in a column are NULL (Pivotal-171) | |

-------------------------------------------------------------------------------- | |

MADlib v0.7 | |

Release Date: 2013-May-01 | |

New Features: | |

* Correlation function: | |

- Function to compute Pearson's cross-correlation for numeric columns in a | |

relational table | |

* Upgrade capability: | |

- All new versions since v0.7 are installed in a version-specific folder | |

(/usr/local/madlib/Versions/) | |

- Upgrade from v0.5/v0.6 to v0.7 on the database is now supported without | |

uninstalling previous MADlib database installation. | |

- Dependencies on updated functions, types, and other operators are caught | |

and upgrade is aborted with an appropriate message | |

Bug fixes: | |

* Linear Regression: | |

- Improved matrix inversion method to compute coefficients comparable to R | |

for regression problems with high multicollinearity (MADLIB-790) | |

* Logistic Regression: | |

- Fixed a problem in logistic regression with grouping on 'text' datatype | |

columns (MADLIB-791) | |

Known issues: | |

* Upgrade: | |

- Views dependent on MADlib functions being updated will be dropped during | |

the upgrade and restored after finishing upgrade. If upgrade fails for | |

any reason, these views and the original MADlib schema will *not* be | |

restored. Before initiating upgrade, we recommend taking a backup of | |

the MADlib schema and move all views dependent on MADlib to separate | |

schema and perform a backup with: | |

pg_dump -n 'schema_name' | |

- Upgrade is currently not supported for the PostgreSQL platform and will | |

abort with an error | |

- Upgrade currently does not detect functions defined by the user that | |

depend upon MADlib functions. Semantic/API changes to these MADlib | |

functions could lead to undefined results in such user-defined functions | |

- Some important changes for the upgrade from v0.5 to v0.7 are given below | |

(Upgrade will raise an error and abort if there exist user-defined views | |

that depend on these changes. User-defined functions are not validated | |

with this check. An aborted upgrade does not affect the installed version | |

of MADlib.) | |

-- Logistic regression renamed from 'logregr' to 'logregr_train' | |

-- All internal and external aggregates in logistic regression | |

have been updated | |

-- PLDA module replaced with a refactored LDA module. Due to the | |

renaming all functions using PLDA need to be updated | |

-- Updated MADlib types: | |

logregr_result, plda_topics_t, plda_word_distrn, | |

plda_word_weight | |

-------------------------------------------------------------------------------- | |

MADlib v0.6 | |

Release Date: 2013-Apr-01 | |

New Features / Improvements: | |

* Generic cross-validation: | |

- Support for k-fold cross-validation of any supervised learning | |

algorithm | |

* Heteroskedasticity of linear regression | |

- Support for calculating heteroskedasticity via Breusch-Pagan test | |

* Grouping support for linear regression | |

- Support for linear regression on each group of data grouped by | |

one or multiple columns | |

* Grouping support for logistic regression | |

- Refactor of logistic regression code | |

- Support for logistic regression on each group of data grouped by | |

one or multiple columns | |

- Grouping support is added to the convex optimization framework | |

* LDA: | |

- Improved performance and scalability (MADLIB-480) | |

* Elastic net regularization for both linear and logistic regressions | |

- Support FISTA and IGD optimizers | |

* Summary function | |

- Support for an overview of data table | |

* Eigen package upgrade | |

- Now Eigen 3.1.2 is used by MADlib v0.6 | |

* Unit testing framework: | |

- A new unit testing framework is added for C++ abstraction layer | |

Bug Fixes: | |

* C++ abstraction layer: | |

- Improved handling of NULL values in the input array (MADLIB-773) | |

* Naive Bayes: | |

- Improved the handling of NULL values. (MADLIB-749) | |

Known Issues: | |

* K-means: | |

- K-means crashes on some datasets, when the dimensionality of the points | |

is not uniform on the data set. (MADLIB-789) | |

* Distribution Functions: | |

- Certain quantile functions will abort their session on invalid input | |

(MADLIB-786) | |

* Multinomial Logistic Regression: | |

- Signs of coefficient outputs are inconsistent with other tools like R and | |

Stata (MADLIB-785) | |

-------------------------------------------------------------------------------- | |

MADlib v0.5 | |

Release Date: 2012-Nov-15 | |

Bug Fixes: | |

* K-means: | |

- Improved handling of invalid arguments (MADLIB-359, 361) | |

* Sketch-based estimators: | |

- Addressed security vulnerability (MADLIB-630) | |

New Features / Improvements: | |

* Association Rules (Apriori): | |

- Improved reporting output format for better usability (MADLIB-411) | |

- Significant improvement in performance (MADLIB-638) | |

* C++ (Database) Abstraction Layer: | |

- Extension to support modular transition states (MADLIB-499) | |

- Extension to support functions returning set of values (MADLIB-638) | |

* Conditional Random fields: | |

- Support for Linear Chain Conditional Random Fields for NLP (MADLIB-628) | |

* Decision Tree: | |

- Improved performance for C4.5 and Random forests (MADLIB-605) | |

- Improved encoding (MADLIB-590) | |

* Infrastructure: | |

- Convex optimization framework | |

* K-means: | |

- Code refactoring and Improved performance | |

(MADLIB-454, MADLIB-522, MADLIB-678) | |

- Silhouette function for k-means (MADLIB-681) | |

* Low-rank Matrix Factorization | |

- New module | |

* Logistic Regression: | |

- Support for Multinomial Logistic Regression (MADLIB-575) | |

* Naive Bayes | |

- Significant improvement in performance (MADLIB-611, 619, 626) | |

* Regression Analysis: | |

- Support for Cox Proportional Hazards test (MADLIB-576) | |

* Sampling | |

- Added weighted sampling of a single row (MADLIB-584) | |

* SVD Matrix Factorization: | |

- Improved performance (MADLIB-578) | |

Documentation: | |

* Conditional Random Fields: | |

- Example added for CRF module (MADLIB-731) | |

* SVD Matrix Factorization: | |

- Incremental-gradient SVD algorithm (MADLIB-572) | |

Known issues: | |

* Multinomial Logistic Regression: | |

- Number of independent variables cannot exceed 65535 (MADLIB-665) | |

* Naive Bayes: | |

- Current implementation of Naive Bayes is only suitable for | |

categorical attributes (MADLIB-679) | |

- NULL input values not accepted for attributes (MADLIB-614) | |

- NULL probabilities given for test set values not seen in | |

training set (MADLIB-523) | |

-------------------------------------------------------------------------------- | |

MADlib v0.4.1 | |

Release Date: 2012-Aug-9 | |

Bug Fixes: | |

* PGXN: | |

- Fixed installation problem that could occur on some platforms (MADLIB-589) | |

New Features/Improvements: | |

* C++ Abstraction Layer: | |

- Increased ABI compatibility across multiple Greenplum versions | |

(MADLIB-606) | |

* Hypothesis Tests: | |

- Tests that are not implemented as ordered aggregates are now also | |

installed on PostgreSQL 8.4 and Greenplum 4.0. | |

-------------------------------------------------------------------------------- | |

MADlib v0.4 | |

Release Date: 2012-Jun-18 | |

Bug Fixes: | |

* Association Rules: | |

- assoc_rules() now uses schema-qualified function calls (MADLIB-435) | |

* Decision Trees: | |

- Enhanced correctness (MADLIB-409, 502, 503) | |

- Improved handling of invalid arguments (MADLIB-331) | |

* k-Means: | |

- Improved handling of invalid arguments (MADLIB-336, 364, 459) | |

* PLDA: | |

- Improved robustness (MADLIB-474) | |

* Sparse Vectors: | |

- svec_sfv() now uses locale-aware sorting (MADLIB-457) | |

- Operators now install to MADlib schema (MADLIB-470) | |

New Features/Improvements: | |

* C++ Abstraction Layer: | |

- Support for "function pointers" (MADLIB-370) | |

- Support for sparse vectors (MADLIB-371) | |

- Support for more Eigen (linear algebra) types (MADLIB-533) | |

* Decision Trees: | |

- Code refactoring and optimization (MADLIB-410, 476, 504, 509) | |

- Documentation improvments (MADLIB-507) | |

- Output table now contains unencoded information (MADLIB-434) | |

- Enhance the missing value handling for continuous features (MADLIB-493) | |

* Hypothesis Tests: | |

- Pearson chi-square test (MADLIB-390) | |

- One- and two-sample t-Tests (MADLIB-391) | |

- F-test (MADLIB-392) | |

- Mann-Whitney U-test (MADLIB-393) | |

- Kolmogorov-Smirnov test (MADLIB-394) | |

- Wilcoxon-Signed-Rank test (MADLIB-405) | |

- One-way ANOVA (MADLIB-406) | |

* PostgreSQL Extensibility: | |

- Support for CREATE EXTENSION in PostgreSQL >= 9.1 (MADLIB-316) | |

- Availability on PGXN (MADLIB-334) | |

* Probability Functions: | |

- Wrap all distribution functions implemented by Boost (MADLIB-412) | |

- Wrap Kolmogorov distribution function from CERN ROOT project (MADLIB-413) | |

* Random Forests: | |

- New module (MADLIB-419) | |

* Support: | |

- Add elementary matrix/vector functions (e.g., norm/distances etc.) | |

(MADLIB-532) | |

* Viterbi Feature Extraction: | |

- New module (MADLIB-478) | |

Known issues: | |

- svec_sfv() does not support collations, as introduced with PostgreSQL 9.1 | |

(MADLIB-558) | |

- Invalid arguments are not always guaranteed to be handled gracefully and | |

may lead to confusing error messages (MADLIB-28, 359, 361, 363) | |

-------------------------------------------------------------------------------- | |

MADlib v0.3 | |

Release Date: 2012-Feb-9 | |

New features: | |

* Installer: | |

- Single installer package targeting all supported DBMSs per OS (MADLIB-218) | |

* C++ Abstraction Layer: | |

- Switched from using Armadillo to using Eigen for linear-algebra | |

operations, thereby eliminating the dependency on LAPACK/BLAS (MADLIB-275) | |

- Reimplemented as a template library for performance improvements | |

(MADLIB-295) | |

* Decision Trees: | |

- Major update | |

- Now supports multiple split criteria (information gain, gini, gain ratio) | |

- Now supports tree pruning using a validation set to address over fitting | |

- Now supports additional functions for tree output | |

- Now supports continuous features in addition to categorical features | |

- Additional support for handling null values | |

- Improved scalability and performance | |

* k-Means Clustering: | |

- Now handles any input that is convertible to SVEC. (MADLIB-42) | |

- Multiple distance functions (L1-norm, L2-norm, cosine similarity, Tanimoto | |

similarity) (MADLIB-43) | |

- Supports multiple seedings methods (kmeans++, random, user-specified list | |

of centroids) | |

- Replaced goodness of fit with the (simplified) Silhouette coefficient | |

(MADLIB-45) | |

- New run-time parameters (MADLIB-47) | |

* Linear Regression: | |

- Major speed improvement | |

* Logistic Regression: | |

- Major speed improvement | |

- Now handles any input that is convertible to BOOLEAN (dependent variable) | |

or DOUBLE PRECISION[] (independent variables). (MADLIB-283) | |

- An under-/overflow safe version to evaluate the (usual) logistic function, | |

for scoring logistic regression (MADLIB-271) | |

- A third optimizer: Incremental-gradient-descent (MADLIB-303) | |

* Support: | |

- For Greenplum <= 4.2.0, added a workaround for INSERT INTO in the same way | |

as the existing CREATE TABLE AS workaround. This workaround is not needed | |

in Greenplum >= 4.2.1 any more. (MADLIB-265) | |

- Function version() returns Madlib build information (MADLIB-309) | |

Bug fixes: | |

* Sparse vectors: | |

- Fixed sparse-vector type case problems (MADLIB-282, MADLIB-305) | |

- Fixed a situation where using svec_svf() could cause a segmentation fault | |

(MADLIB-350) | |

- Increased compatibility with internal PostgreSQL conventions (MADLIB-257) | |

* Logistic regression: | |

- Handle numerical instability more gracefully (MADLIB-343, MADLIB-345) | |

- Handle unexpected inputs more gracefully (MADLIB-284, MADLIB-344) | |

- Fixed "Random variate x is nan, but must be finite" issue (MADLIB-356) | |

Known issues: | |

- Decision Trees not supported on Greenplum 4.0 (MADLIB-346, MADLIB-347) | |

- K-means: the error '"nan" does not exist' may be raised when input vectors | |

contain NaN. (MADLIB-364) | |

- Association Rules require the madlib schema to be in the search path | |

(MADLIB-353) | |

- Invalid arguments are not always guaranteed to be handled gracefully and | |

may lead to confusing error messages (MADLIB-28, 336, 359, 361, 363, 364) | |

-------------------------------------------------------------------------------- | |

MADlib v0.2.1beta | |

Release Date: 2011-Sep-14 | |

General changes: | |

* numerous improvements to the C++ abstraction layer: | |

- code clean-up | |

- fixed issue where incorrect values were returned when used with | |

debug builds of PostgreSQL/Greenplum (MADLIB-253) | |

- fixed issue where returning arrays to PostgreSQL/Greenplum could lead | |

to a crash (MADLIB-250) | |

- allocated memory is now 16-byte aligned for improved stability and | |

performance (MADLIB-236) | |

* compiling with advanced warnings enabled by default now | |

* all C/C++ code now free of warnings. On gcc <= 4.6, there might still be | |

warnings due to "unclean" macros in DBMS header files (MADLIB-228) | |

* prepared Solaris support in a later release (MADLIB-204) | |

- added support for Sun Compiler in CMake build script | |

- fixed all compilation errors with Sun compiler | |

* added UDF to mimic "CREATE TABLE AS ...", as a workaround for a Greenplum | |

issue (MADLIB-241). Included this as GP Compatibility module. | |

* madpack utility: | |

- dropped madpack dependency on PygreSQL (MADLIB-217) | |

- improved security in madpack install-check (MADLIB-229) | |

- fixed bashism in madpack (MADLIB-222) | |

- fixed install-check not running on non-default schema (MADLIB-251) | |

Modules/methods: | |

* SVM (kernel_machines): | |

- fixed cumulative error count in svm_cls_update() function | |

- improved memory management in SVM module | |

* Linear regression (regress): | |

- fixed unexpected behavior for some edge cases (MADLIB-214) | |

- fixed crashing with huge number of independent vars (MADLIB-250) | |

* Logistic regression (regress): | |

- added support for arbitrary expressions for dep./indep. variables, not | |

just column names (MADLIB-255) | |

* Quantile: | |

- fixed quantile() function to be exact | |

- added simple version for small data sets | |

* Sparse Vectors: | |

- added check for sorted dictionary to svec_sfv (MADLIB-187) | |

* Decision Tree (decision_tree): | |

- now can be run multiple times in one session (MADLIB-156) | |

Known issues: | |

* non-unified API for several SQL UDFs (MADLIB-208) | |

* performance of the conjugate-gradient optimizer in logistic regression | |

can be very poor (MADLIB-164) | |

-------------------------------------------------------------------------------- | |

MADlib v0.2.0beta | |

Release Date: 2011-Jul-8 | |

General changes: | |

* new build and installation framework based on CMake | |

* new C++ abstraction layer for easy and secure method development | |

* new database installation utility (madpack) | |

Modules/methods: | |

* new: Association Rules (assoc_rules) | |

* new: Array Operators (array_ops) | |

* new: Decision Tree (decision_tree) | |

* new: Conjugate Gradient (conjugate_gradient) | |

* new: Parallel LDA (plda) | |

* improved: all methods from previous release | |

Known issues: | |

* non-unified API for several SQL UDFs (MADLIB-208) | |

* running decision tree more than once in one session fails (MADLIB-156) | |

* performance of the conjugate-gradient optimizer in logistic regression | |

can be very poor (MADLIB-164) | |

* svec_sfv function doesn't check for sorted dictionary (MADLIB-187) | |

-------------------------------------------------------------------------------- | |

MADlib v0.1.0alpha | |

Release Date: 2011-Jan-31 | |

Initial release. | |

Included modules/methods: | |

* Naive-Bayes Classification (bayes) | |

* k-Means Clustering (kmeans) | |

* Support Vector Machines (kernel_machines) | |

* Sketch-based Estimators (sketch) | |

* Sketch-based Profile (data_profile) | |

* Quantile (quantile) | |

* Linear & Logistic Regression (regress) | |

* SVD Matrix Factorisation (svdmf) | |

* Sparse Vectors (svec) | |

-------------------------------------------------------------------------------- | |

MADlib v0.1.0prerelease | |

Release date: 2011-Jan-25 | |

Demo release. |