| MADlib Release Notes |
| -------------------- |
| |
| These release notes contain the significant changes in each MADlib release, |
| with most recent versions listed at the top. |
| |
| A complete list of changes for each release can be obtained by viewing the git |
| commit history located at https://github.com/apache/madlib/commits/master. |
| |
| Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB. |
| —------------------------------------------------------------------------- |
| MADlib v1.15.1: |
| |
| Release Date: 2018-Oct-XX |
| |
| New features: |
| - Add ubuntu support for MADlib (MADLIB-1256). |
| - Elastic Net: Add grouping by non-numeric column support (MADLIB-1262). |
| - KNN: Accept expressions for point_column_name and test_column_name (MADLIB-1060). |
| - Vec2Cols: Allow arrays of different lengths (MADLIB-1270). |
| - Madpack: Add a script for automating changelist creation. |
| |
| Bug fixes: |
| - Allocator: Remove 16-byte alignment in GPDB 6. |
| - Build: Download compatible Boost if version >= 1.65 (MADLIB-1235). |
| - Build: Remove primary key constraint in IC/DC. |
| - CMake: Fix false positive for Postgres 10+ check. |
| - Graph: Add id of nodes with 0 in-degree (MADLIB-1279). |
| - Margins: Copy summary table instead of renaming (MADLIB-1276). |
| - MLP: Simplify momentum and Nesterov updates (MADLIB-1272). |
| - Upgrade: Fix issue with upgrading RPM to 1.15.1 (MADLIB-1278). |
| - Utilities: Use plpy.quote_ident if available. |
| |
| Others: |
| - Simplify maintenance via removing online examples from sql functions (MADLIB-1260). |
| - Re-enable PCA and PageRank tests (MADLIB-1264). |
| - Build: Disable AppendOnly if available ( MADLIB-1273). |
| - Improve documentation of various modules. |
| |
| |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.15: |
| |
| Release Date: 2018-Aug-15 |
| |
| New features: |
| * MLP: Added momentum and Nesterov's accelerated gradient methods to gradient |
| updates (MADLIB-1210). |
| * New modules: |
| - drop_cols: Create new table from an existing table (CTAS) using an |
| expression of column names (MADLIB-1241). |
| - cols2vec: Create an array from multiple columns (similar to ARRAY[...] |
| with columns obtained using an expression) (MADLIB-1239). |
| - vec2cols: Create multiple columns from an existing array (MADLIB-1240). |
| * Statistics: Added grouping support to correlation and covariance |
| functions (MADLIB-1128). |
| * DT/RF: |
| - Added impurity importance values in DT and RF (MADLIB-1205, 1246, 1249). |
| - Added a new function (get_var_importance) to report importance values |
| in an cleaner interface (MADLIB-925). |
| * Madpack: |
| - Refactored and updated the installation scripts to ensure install, |
| reinstall, install-check are all run from a single SQL file as an atomic |
| operation (MADLIB-1242). |
| - Moved most of install-check operations to a new "dev-check", making |
| install-check smaller and faster to run. |
| - Added new option to run unit-tests (MADLIB 1251, 1252). |
| |
| |
| Bug fixes: |
| - Fixed an ABI issue that prevented compiling MADlib on GCC 5+ |
| (MADLIB-1025). |
| - Decision trees: |
| - Fixed a minor bug that prevented sparse vector to float8[] |
| (MADLIB-1234). |
| - Fixed a bug that led to dependent type being obtained from a NULL |
| value (MADLIB-1233). |
| - Summary table has been updated to ensure correct feature names are |
| populated (MADLIB-1236). |
| - Fixed incorrect indexing of trueChild and falseChild in surrogate |
| agreement calculation. |
| - Removed categorical variable elimination to avoid issues with varying |
| categorical variables for different groups (MADLIB-1258, 1254). |
| - Logregr: Fixed issue where an output table could be empty for grouping |
| (MADLIB-1172). |
| - Added special characters support for multiple modules |
| (MADLIB-1237, 1238, 1243). |
| - Build: Removed invalid symlinks left behind after an uninstall |
| (MADLIB-1175). |
| - Updated SVM to correctly report loss per row instead of total loss. |
| - Refactored internal CV function to fix multiple issues with cross |
| validation on SVM (MADLIB-1250). |
| - Worked-around a "cache lookup" issue that prevented dropping of |
| install-check user (MADLIB-1014). |
| - Pagerank: Removed duplicate entries from grouping output |
| (MADLIB-1229, 1253). |
| - Madpack: Install-check user is dropped even after an IC failure |
| (MADLIB-1182). |
| |
| Others: |
| - Removed HAWQ support from all modules |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.14: |
| |
| Release Date: 2018-April-28 |
| |
| New features: |
| * New module - Balanced datasets: A sampling module to balance classification |
| datasets by resampling using various techniques including undersampling, |
| oversampling, uniform sampling or user-defined proportion sampling |
| (MADLIB-1168) |
| * Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor function |
| necessary to create batches from the data (MADLIB-1200, MADLIB-1206, |
| MADLIB-1220, MADLIB-1224, MADLIB-1226, MADLIB-1227) |
| * k-NN: Added weighted averaging/voting by distance (MADLIB-1181) |
| * Summary: Added additional stats: number of positive, negative, zero values and |
| 95% confidence intervals for the mean (MADLIB-1167) |
| * Encode categorical: Updated to produce lower-case column names when possible |
| (MADLIB-1202) |
| * MLP: Added support for already one-hot encoded categorical dependent variable |
| in a classification task (MADLIB-1222) |
| * Pagerank: Added option for personalized vertices that allows higher weightage |
| for a subset of vertices which will have a higher jump probability as |
| compared to other vertices and a random surfer is more likely to |
| jump to these personalization vertices (MADLIB-1084) |
| |
| Bug fixes: |
| - Fixed issue with invalid calls of construct_array that led to problems |
| in Postgresql 10 (MADLIB-1185) |
| - Added newline between file concatenation during PGXN install (MADLIB-1194) |
| - Fixed upgrade issues in knn (MADLIB-1197) |
| - Added fix to ensure RF variable importance are always non-negative |
| - Fixed inconsistency in LDA output and improved usability |
| (MADLIB-1160, MADLIB-1201) |
| - Fixed MLP and RF predict for models trained in earlier versions to |
| ensure missing optional parameters are given appropriate default values |
| (MADLIB-1207) |
| - Fixed a scenario in DT where no features exist due categorical columns |
| with single level being dropped led to the database crashing |
| - Fixed step size initialization in MLP based on learning rate policy |
| (MADLIB-1212) |
| - Fixed PCA issue that leads to failure when grouping column is a TEXT type |
| (MADLIB-1215) |
| - Fixed cat levels output in DT when grouping is enabled (MADLIB-1218) |
| - Fixed and simplified initialization of model coefficients in MLP |
| - Removed source table dependency for predicting regression models in MLP |
| (MADLIB-1223) |
| - Print loss of first iteration in MLP (MADLIB-1228) |
| - Fixed MLP failure on GPDB 4.3 when verbose=True (MADLIB-1209) |
| - Fixed RF issue that showed up when var_importance=True with no continuous |
| features (MADLIB-1219) |
| - Fixed DT/RF issue for null_as_category=True and grouping enabled |
| (MADLIB-1217) |
| |
| Other: |
| - Reduced install-check runtime for PCA, DT, RF, elastic net (MADLIB-1216) |
| - Added CentOS 7 PostgreSQL 9.6/10 docker files |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.13: |
| |
| Release Date: 2017-December-22 |
| |
| New features: |
| * New module: Graph - HITS (MADLIB-1124, MADLIB-1151) |
| * k-NN: |
| - Added additional distance metrics (MADLIB-1059) |
| - Added list of neighbors in output table (MADLIB-1129) |
| * MLP: Added grouping support (MADLIB-1149) |
| * Cross Validation: Improved the stats reporting in output table (MADLIB-1169) |
| * Correlation: Improved quality of results by ignoring only a NULL value and |
| not the whole row containing the NULL (MADLIB-1166) |
| |
| Bug fixes: |
| - Fixed issue with Decision Trees (DT) trained in older versions not |
| being usable in predict of v1.12 (MADLIB-1161) |
| - Fixed invalid assert statement in DT (MADLIB-1164) |
| - Improved feature array handling in DT (MADLIB-1173) |
| - Fixed install-check failures on non-default schema installation (MADLIB-1177, 1184) |
| |
| Other: |
| - Updated PyXB from 1.2.4 to 1.2.6. (MADLIB-1103) |
| This change eliminates the need to remove part of PyXB code base as a |
| GPL-workaround. |
| - Updated the naming for gppkg (MADLIB-1183) |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.12: |
| |
| Release Date: 2017-August-18 |
| |
| New features: |
| * New module: Graph - All Pairs Shortest Path (MADLIB-1072, MADLIB-1099, MADLIB-1106) |
| * New module: Graph - Weakly Connected Components (MADLIB-1071, MADLIB-1083, MADLIB-1101) |
| * New module: Graph - Breadth First Search (MADLIB-1102) |
| * New module: Graph - Measures (MADLIB-1073) |
| * New Module: Sample - Stratified Sampling (MADLIB-986) |
| * New Module: Sample - Train-test split (MADLIB-1119) |
| * New Module: Multilayer Perceptron (MADLIB-413, MADLIB-1134) |
| * DT and RF: |
| - Allow expressions in feature list (MADLIB-1087) |
| - Allow array input for features (MADLIB-965) |
| - Filter NULL dependent values in OOB (MADLIB-1097) |
| - Add option to treat NULL as category |
| * Summary: |
| - Allow user to determine the number of columns per run (MADLIB-1117) |
| - Improve efficiency of computation time by ~35% (MADLIB-1104) |
| * Sketch: |
| - Promote cardinality estimators to top level module from early stage (MADLIB-1120) |
| * Add basic code coverage support (MADLIB-1138) |
| * Updates for Apache Top Level Project readiness (MADLIB-1112, MADLIB-1130, MADLIB-1133, MADLIB-1142) |
| |
| Bug fixes: |
| - DT and RF: |
| - Fix array to string conversion with CV |
| - Include NULL rows in count for termination check |
| - Sketch: |
| - Remove per-tuple checks for better performance |
| - PageRank: |
| - Fix multiple bugs and perf issue in grouping (MADLIB-1100, MADLIB-1107) |
| - Kmeans: |
| - Fix IC drop table statements |
| - Graph: |
| - Fix quoted output table name bug (MADLIB-1137) |
| - Fix empty string arguments bug |
| - Elastic Net: |
| - Fix the data scaling bug with normalization (MADLIB-1094) |
| - Reduce the tolerance for a faster IC test (MADLIB-1118) |
| - Control: |
| - Update 'optimizer' GUC only if editable (MADLIB-1109) |
| |
| Other: |
| - Build: Add CDATA block to avoid invalid xml |
| - Multiple user documentation improvements |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.11: |
| |
| Release Date: 2017-May-05 |
| |
| New features: |
| * New module: Graph - PageRank |
| - Implements the original PageRank algorithm that assumes a random surfer model |
| (https://en.wikipedia.org/wiki/PageRank#Damping_factor) (MADLIB-1069) |
| - Grouping support is included for PageRank (MADLIB-1082) |
| * Graph - Single Source Shortest Path (SSSP): Add grouping support (MADLIB-1081) |
| * Pivot: Add support for array and svec output types (MADLIB-1066) |
| * DT and RF: |
| - Change default values for 2 parameters (max_depth and num_splits) |
| - Reduce memory footprint: Assign memory only for reachable nodes (MADLIB-1057) |
| - Include rows with NULL features in training (MADLIB-1095) |
| - Update error message for invalid parameter specification (num_splits) |
| * Array Operations: Add function to unnest 2-D arrays by one level into rows |
| of 1-D arrays (MADLIB-1086) |
| * Build process on Apache infrastructure (MADLIB-920, MADLIB-1080) |
| * Updates for Apache Top Level Project readiness (MADLIB-1022, MADLIB-1076, |
| MADLIB-1077, MADLIB 1090) |
| * Support for GPDB 5.0 |
| |
| Bug fixes: |
| - DT and RF: |
| - Fix accuracy issues related to integer categorical variables and tree depth |
| - Improve visualization of tree(s) |
| - Elastic Net: |
| - Fix install check on GPDB 5.0 and HAWQ 2.2 (MADLIB-1088) |
| - Fix inconsistent results with grouping (MADLIB-1092) |
| - PCA: Fix install check |
| |
| Other: |
| - PMML: Skip install check when run without the ‘-t’ option (MADLIB-1078) |
| - Multiple user documentation improvements |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.10.0 |
| |
| Release Date: 2017-February-17 |
| |
| New features: |
| * New module: Graph - Single Source Shortest Path (SSSP) (MADLIB-992) |
| - Calculate the shortest path from a given vertex to every vertex in the graph. |
| * New module: Encode categorical variables (MADLIB-1038) |
| - Completely new version for dummy/one-hot encoding of categorical variables |
| with new name and different arguments. |
| - Previous version has been deprecated. |
| * New module (early stage): K-Nearest Neighbors (KNN) (MADLIB-927) |
| - Find the k nearest neighbors based on the squared_dist_norm2 metric. |
| * Elastic Net: Add grouping support (MADLIB-950) |
| - Elastic net train for both Gaussian and Binomial models, with FISTA |
| and IGD optimizations support grouping. |
| - Use active sets for FISTA, but active sets are used only after the |
| log-likelihood of all the groups becomes 0. |
| * Elastic Net: Add cross validation (MADLIB-996) |
| * PCA: Add grouping support (MADLIB-947) |
| * PCA: Removed column id restriction. |
| * Kmeans: Cluster variance for PivotalR support. |
| * Kmeans: Support for array input. (MADLIB-1018) |
| * DT and RF: Verbose option for the dot output format. (MADLIB-1051) |
| * Association Rules: Add rule counts and limit itemset size feature |
| (MADLIB-1044, MADLIB-1031) |
| * Boost library has been upgraded from 1.47 to 1.61 |
| * Multiple improvements to the build system (madpack, cmake etc.) to support |
| Semantic versioning and various versions of GPDB and HAWQ. |
| |
| Bug fixes: |
| - Pivot: Adjust the warning level to remove redundant messages. |
| - RF: Fix the online help and examples. |
| - Utilities: Fix incorrect flag for distribution. |
| - Install check: Update date format and remove hardcoded schema names. |
| - Multiple user documentation improvements. |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.9.1 |
| |
| Release Date: 2016-August-25 |
| |
| New features: |
| * New function: One class SVM (MADLIB-990) |
| - Added a one-class SVM that classifies new data as similar or different to |
| the training set. |
| - This method is an unsupervised method that builds a decision boundary |
| between the data and origin in kernel space and can be used as a novelty |
| detector. |
| * SVM: Added functionality to assign weights to each class, simplying |
| classification of unbalanced data. (MADLIB-998) |
| * New function: Prediction metrics (MADLIB-907) |
| Added a collection of summary statistics to gauge model accuracy based on |
| predicted values vs. ground-truth values. |
| * New function: Sessionization (MADLIB-909, MADLIB-1001) |
| Added a sessionize function to perform session reconstruction on a data |
| set so it can be prepared for input into other algorithms such as |
| path functions or predictive analytics algorithms. |
| * New function: Pivot (MADLIB-908, MADLIB-1004) |
| Added a function to that can do basic OLAP type operations on data stored |
| in one table and output the summarized data to a second table. |
| * Path: Major performance improvement (MADLIB-984) |
| * Path: Add support for overlapping patterns (MADLIB-995) |
| * Build: Add support for PG 9.5 and 9.6 (MADLIB-944) |
| * PGXN: Update PostgreSQL Extension Network to latest release (MADLIB-959) |
| |
| Bug fixes: |
| - Random Forest: Fix filtered feature related bug (MADLIB-928) |
| - Elastic Net: Skip arrays with NULL values in train (MADLIB-978) |
| - Matrix: Fix starting index in extract functions (MADLIB-1006) |
| - Path: Allow multiple expressions in partition expression (MADLIB-1003) |
| - DT: Fix bin computation for boolean features (MADLIB-1011) |
| - Multiple user documentation improvements (MADLIB-1001) |
| |
| —------------------------------------------------------------------------- |
| MADlib v1.9 |
| |
| Release Date: 2016-April-04 |
| |
| New features: |
| * New module: Path |
| - Perform pattern matching over a sequence of rows and extracts useful |
| information about the pattern matches. |
| - Useful in a wide variety of use cases: on-line shopping, predictive |
| maintenance, cyber security, IoT, customer churn, etc. |
| - Define arbitrarily complex symbols to identify rows of interest. |
| - Perform regular pattern matching of symbols over a sequence of ordered partitions. |
| - Extract useful information about the pattern matches (counts, |
| aggregations, window functions). |
| * New module: Support Vector Machines (SVM) |
| - Complete rewrite of SVM algorithm to improve accuracy and performance. |
| - Support for classification and regression. |
| - Support for non-linear kernels (Gaussian and Polynomial). |
| - Cross validation support on parameters: lambda, epsilon, initial step size, |
| maximum iterations, and decay factor. |
| * New module: Stemmer function |
| - Compute the root of any English text input using Porter2 stemming algorithm. |
| * New matrix operations (Phase 2) |
| - Added following operations/functions for dense and sparse matrices: |
| - Representation: get matrix dimensions |
| - Extraction/visitor methods: extract diagonal elements |
| - Reduction operations: compute matrix norm |
| - Creation methods: initialize with ones, initialize with zeros, |
| square identity matrix, diagonal matrix, sample from distribution |
| (Normal, Uniform, Bernoulli) |
| - Decomposition operations: inverse, generic inverse, eigen extraction, |
| Cholesky decomposition, QR decomposition, LU decomposition, nuclear norm, rank |
| * Pearson's correlation module: added option to return the covariance matrix |
| * PCA: added option to use proportion of variance to determine number of |
| principle components to return (MADLIB-948) |
| * PivotalR support for Latent Dirichlet Allocation (LDA) |
| * Quotation and international character support (Phase 2) |
| - All modules now support table and column names that are quoted and |
| contain international characters. This release adds support for: |
| - Cross Validation |
| - Dense Linear Systems |
| - Sparse Linear Systems |
| - Low-rank Matrix Factorization |
| - Conditional Random Field |
| - Hypothesis Tests |
| - Support Modules/Data Preparation |
| - Support Modules/PMML Export |
| - ARIMA |
| * New platform: |
| - Added support for HAWQ 2.0 |
| * Miscellaneous: |
| - Updated documentation and more examples |
| - Term frequency: added support for custom column names |
| - Updated licensing files and headers to comply with ASF regulations |
| |
| Bug fixes: |
| - Elastic Net: Skips arrays with NULL values in predict (MADLIB-919) |
| - Hello World example: Fixed 'this' pointer errors (MADLIB-967) |
| - Hypothesis tests: Fixed docs and examples (MADLIB-895) |
| - Matrix: Fixed inconsistent type in drop statements |
| - Decision Tree: Fixed format specifier in online help (MADLIB-968) |
| - Minor: Updated volatile install-check |
| - LDA: Fixed the padding for LDA model |
| - Decision tree: Fixed to cast count(*) output to long (MADLIB-917) |
| - Validation: Fixed varchar array error in install-check |
| - Matrix: Fixed multiple input/output issues (MADLIB-932) |
| - Matrix: Fixed minor issue with sparse LU output |
| - Summary: Fixed the case for unquoted table names by moving the compare to |
| SQL (MADLIB-954) |
| - Correlation: Fixed to return columns sorted in ordinal position. (MADLIB-941) |
| - Elastic Net: Removed the enforcement of same numeric type while keeping the |
| error for non-numeric types. (MADLIB-952) |
| - K-means: Fixed the error caused by a null value in the matrix or vector. |
| (MADLIB-946) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.8 |
| |
| Release Date: 2015-July-17 |
| |
| New features: |
| * Improved Latent Dirichlet Allocation (LDA) Performance |
| - Function lda_train() is about twice as fast. |
| - Improved the scalability of the function |
| (vocabulary size x number of topics can be up to 250 million). |
| * New module: Matrix operations |
| Added the following operations/functions for dense and sparse matrices: |
| - Mathematical operations: addition, subtraction, multiplication, |
| element-wise multiplication, scalar and vector multiplication. |
| - Aggregation operations: apply various operations including |
| max, min, sum, mean along a specified dimension. |
| - Visitor methods: extract row/column from matrix. |
| - Representation: convert a matrix to either dense or sparse representation. |
| * Quotation and International Character Support |
| - Most modules now support table and column names that are quoted and |
| contain international characters, including: |
| - Regression models (GLMs, linear regression, elastic net, etc.) |
| - Decision trees and random forests |
| - Unsupervised learning models (association rules, k-means, LDA, etc.) |
| - Summary, Pearson's correlation, and PCA |
| * Array Norms and Distances |
| - Generic p-norm distance |
| - Jaccard distance |
| - Cosine similarity |
| * Text Analysis: |
| - Text utility for term frequency and vacabulary construction (prepares |
| documents for input to LDA). |
| * Miscellaneous |
| - Improved organization of User and Developer guide at doc.madlib.net/latest. |
| - Low-rank matrix factorization: added 32-bit integer aupport (MADLIB-903). |
| - Cross-validation: added classification support (MADLIB-908). |
| - Added a new clean-up function for removing MADlib temporary tables. |
| |
| Note: |
| - LDA models that are trained using MADlib v1.7.1 or earlier need to be |
| re-trained to be used in MADlib v1.8. |
| |
| Known issues: |
| - Performance for decision tree with cross-validation is poor on a HAWQ |
| multi-node system. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.7.1 |
| |
| Release Date: 2015-March-18 |
| |
| New features: |
| * Random Forest Performance Improvement |
| - Function forest_train() is 1.5X ~ 4X faster without variable importance, |
| and up to 100X faster with variable importance |
| - Function forest_predict() is up to 10X faster when type='response' |
| - Allow user-specified sample ratio to train with a small subsample |
| * Gaussian Naive Bayes: allow continuous variables |
| * K-Means: Allow user-specified sample ratio for K-means++ seeding |
| * Miscellaneous |
| - Array functions: array_square() for element-wise square, madlib.sum() |
| for array element-wise aggregation |
| - Madpack does not require password when not necessary (MADLIB-357) |
| - Platform support of PostgreSQL 9.4 and HAWQ 1.3 |
| - Allow views and materialized views for training functions |
| - Support quantile computation in summary functions for HAWQ and PG 9.4 |
| |
| Bug fixes: |
| - Fixed the support of multiple parameter values and NULL in general |
| cross-validation (MADLIB-898, MADLIB-896) |
| - Fixed infinite loop when detecting recursive view-to-view dependencies for |
| upgrading (MADLIB-901) |
| - Allow user-specified column names in PCA and multinom_predict() |
| |
| Known issues: |
| - Performance for decision tree with cross-validation is poor on a HAWQ |
| multi-node system. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.7 |
| |
| Release Date: 2014-December-31 |
| |
| New features: |
| * Generalized Linear Model: |
| - Added a new generic module for GLM functions that allow for response |
| variables that have arbitrary distributions (rather than simply |
| Gaussian distributions), and for an arbitrary function of the response |
| variable (the link function) to vary linearly with the predicted values |
| (rather than assuming that the response itself must vary linearly). |
| - Available distribution families: gaussian (link functions: identity, |
| inverse and log), binomial (link functions: probit and logit), |
| poisson (link functions: log, identity and square-root), gamma (link |
| functions: inverse, identity and log) and inverse gaussian (link functions: |
| square-inverse, inverse, identity and log). |
| - Deprecated 'mlogregr_train' in favor of 'multinom' available as part of |
| the new GLM functionality. |
| - Added a new 'ordinal' function for ordered logit and probit regression. |
| * Decision Tree: Reimplemented the decision tree module which includes following |
| changes: |
| - Improved usability due to a new interface. |
| - Performance enhancements upto 40 times faster than the old interface. |
| - Additional features like pruning methods, surrogate variables for |
| NULL handling, cross validation, and various new tree tuning parameters. |
| - Addition of a new display function to visualize the trained tree and new |
| prediction function for scoring of new datasets. |
| * Random Forest: Reimplemented the random forest module which includes following |
| changes: |
| - New random forest module based on the new decision tree module. |
| - Better variable importance metrics and ability to explore each tree |
| in the forest independently. |
| - Ability to get class probabilities of all classes and not just the max |
| class during prediction. |
| - Improved visualization with export capabilities using Graphviz dot format. |
| * PMML: |
| - Upgraded compatible PMML version to 4.1. |
| - Moved PMML export out of early stage development with new functionality |
| available to export GLM, decision tree, and random forest models. |
| * Updated Eigen from 3.1.2 to 3.2.2. |
| * Updated PyXB from 1.2.3 to 1.2.4. |
| * Added finer granularity control for running specific install-check tests. |
| |
| Bug fixes: |
| - Fixed bug in K-means allowing use of user-defined metric functions |
| (MADLIB-874, MADLIB-875). |
| - Fixed issues related to header files included in the build system |
| (MADLIB-855, MADLIB-879, MADLIB-884). |
| |
| Known issues: |
| - Performance for decision tree with cross-validation is poor on a HAWQ |
| multi-node system. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.6 |
| |
| Release Date: 2014-June-30 |
| |
| New features: |
| - Added a new unified 'margins' function that computes marginal effects for |
| linear, logistic, multilogistic, and cox proportional hazards regression. The |
| new function also introduces support for interaction terms in the independent |
| array. |
| - Updated convergence for 'elastic_net_train' by checking the change in the |
| loglikelihood instead of the l2-norm of the change in coefficients. This allows |
| for faster convergence in problems with multiple optimal solutions. |
| The default threshold for convergence has been reduced from 1e-4 to 1e-6. |
| - Added a new helper function to convert categorical variables to indicator |
| variables which can be used directly in regression methods. The function |
| currently only supports dummy encoding. |
| - Improved performance for cox proportional hazards: average improvement of |
| 20 fold on GPDB and 2.5 fold on HAWQ. |
| - Improved performance on ARIMA by 30%. |
| - Added new functionality to export linear and logistic regression models as a |
| PMML object. The new module relies on PyXB to create PMML elements. |
| - Added a function ('array_scalar_add') to 'add' a scalar to an array. |
| - Added 'numeric' type support for all functions that take 'anyarray' as |
| argument. |
| - Made usability and aesthetic enhancements to documentation. |
| |
| Bug Fixes: |
| - Prepended python module name to sys.path before executing madlib function |
| to avoid conflicts with user-defined modules. |
| - Added a check in K-Means to ensure dimensionality of all data points are |
| the same and also equal to the dimensionality of any provided initial centroids |
| (MADLIB-713, MADLIB-789). |
| - Added a check in multinomial regression to quit early and cleanly if model |
| size is greater than the maximum permissible memory (MADLIB-667). |
| - Fixed a minor bug with incorrect column names in the decision trees module |
| (MADLIB-763). |
| - Fixed a bug in Kmeans that resulted in incorrect number of centroids for |
| particular datasets (MADLIB-857). |
| - Fixed bug when grouping columns have same name as one of the output table |
| column names (MADLIB-833). |
| |
| Deprecated Functions: |
| - Modules profile and quantile have been deprecated in favor of the 'summary' |
| function. |
| - Module 'svd_mf' has been deprecated in favor of the improved 'svd' function. |
| - Functions 'margins_logregr' and 'margins_mlogregr' have been deprecated in |
| favor of the 'margins' function. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.5 |
| |
| Release Date: 2014-Mar-05 |
| |
| New features: |
| - Added a new port 'HAWQ'. MADlib can now be used with the Pivotal |
| Distribution of Hadoop (PHD) through HAWQ |
| (see http://www.gopivotal.com/big-data/pivotal-hd for more details). |
| - Implemented performance improvements for linear and logistic predict functions. |
| - Moved Conditional Random Fields (CRFs) out of early stage development, and |
| updated the design and APIs for to enable ease of use and better functionality. |
| API changes include lincrf replaced by lincrf_train, crf_train_fgen and |
| crf_test_fgen with updated arguments, and format of segment tables. |
| - Improved linear support vector machines (SVMs) by enabling iterations, and |
| removed lsvm_predict and svm_predict, which are not useful in GPDB and HAWQ. |
| - Added new functions, with improved performance compared to svec_sfv, for |
| document vectorization into sparse vectors. |
| - Removed the bool-to-text cast and updated all functions depending on it to |
| explicitly convert variable to text. |
| - Added function properties for all SQL functions to allow the database optimizer |
| to make better plans. |
| |
| Bug Fixes: |
| - Set client_min_messages to 'notice' during database installation to ensure |
| that log messages don't get logged to STDERR. |
| - Fixed elastic net prediction to predict using all features instead of just |
| the selected features to avoid an error when no feature is selected as relevant |
| in the trained model. |
| - For corner probability values, p=0 and p=1, in bernoulli and binomial |
| distributions, the quantile values should be 0 and num_of_trials (=1 in the case |
| of bernoulli) respectively, independent of the probability of success. |
| - Changed install script to explicitly use /bin/bash instead of /bin/sh to avoid |
| problems in Ubuntu where /bin/sh is linked to 'dash'. |
| - Fixed issue in Elastic Net to take any array expression as input instead of |
| specifically expecting the expression 'ARRAY[...]'. |
| - Fixed wrong output in percentile of count-min (CM) sketches. |
| |
| Known issues: |
| - Elastic net prediction wrapper function elastic_net_prediction is not |
| available in HAWQ. Instead, prediction functionality is available for both |
| families via elastic_net_gaussian_predict and elastic_net_binomial_predict. |
| - Distance metrics functions in K-Means for the HAWQ port are restricted to the |
| in-built functions, specifically squaredDistNorm2, distNorm2, distNorm1, |
| distAngle, and distTanimoto. |
| - Functions in Quantile and Profile modules of Early Stage Development are not |
| available in HAWQ. Replacement of these functions is available as built-in |
| functions (percentile_cont) in HAWQ and Summary module in MADlib, respectively. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.4.1 |
| |
| Release Date: 2013-Dec-13 |
| |
| Bug Fixes: |
| - Fixed problem in Elastic Net for 'binomial' family if an 'integer' column was |
| passed for dependent variable instead of a 'boolean' column. |
| - '*' support in Elastic Net lacked checks for the columns being combined. Now |
| we check if the column for '*' is already an array, in which case we don't wrap |
| it with an 'array' modifier. If there are multiple columns we check that they |
| are of the same numeric type before building an array. |
| - Fixed a software regression in Robust Variance, Clustered Variance and |
| Marginal Effects for multinomial regression introduced in v1.4 when |
| output table name is schema-qualified. |
| - We now also support schema-qualified output table prefixes for SVD and PCA. |
| - Added warning message when deprecated functions are run. Also added a list of |
| deprecated functions in the ReadMe. |
| - Added a Markdown Readme along with the text version for better rendering on |
| Github. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.4 |
| |
| Release Date: 2013-Nov-25 |
| |
| New Features: |
| * Improved interface for Multinomial logistic regression: |
| - Added a new interface that accepts an 'output_table' parameter and |
| stores the model details in the output table instead of returning as a struct |
| data type. The updated function also builds a summary table that includes |
| all parameters and meta-parameters used during model training. |
| - The output table has been reformatted to present the model coefficients |
| and related metrics for each category in a separate row. This replaces the |
| old output format of model stats for all categories combined in a |
| single array. |
| * Variance Estimators |
| - Added Robust Variance estimator for Cox PH models (Lin and Wei, 1989). |
| It is useful in calculating variances in a dataset with potentially |
| noisy outliers. Namely, the standard errors are asymptotically normal even |
| if the model is wrong due to outliers. |
| - Added Clustered Variance estimator for Cox PH models. It is used |
| when data contains extra clustering information besides covariates and |
| are asymptotically normal estimates. |
| * NULL Handling: |
| - Modified behavior of regression modules to 'omit' rows containing NULL |
| values for any of the dependent and independent variables. The number of |
| rows skipped is provided as part of the output table. |
| This release includes NULL handling for following modules: |
| - Linear, Logistic, and Multinomial logistic regression, as well as |
| Cox Proportional Hazards |
| - Huber-White sandwich estimators for linear, logistic, and multinomial |
| logistic regression as well as Cox Proportional Hazards |
| - Clustered variance estimators for linear, logistic, and multinomial |
| logistic regression as well as Cox Proportional Hazards |
| - Marginal effects for logistic and multinomial logistic regression |
| |
| Deprecated functions: |
| - Multinomial logistic regression function has been renamed to |
| 'mlogregr_train'. Old function ('mlogregr') has been deprecated, |
| and will be removed in the next major version update. |
| |
| - For all multinomial regression estimator functions (list given below), |
| changes in the argument list were made to collate all optimizer specific |
| arguments in a single string. An example of the new optimizer parameter is |
| 'max_iter=20, optimizer=irls, precision=0.0001'. |
| This is in contrast to the original argument list that contained 3 arguments: |
| 'max_iter', 'optimizer', and 'precision'. This change allows adding new |
| optimizer-specific parameters without changing the argument list. |
| Affected functions: |
| - robust_variance_mlogregr |
| - clustered_variance_mlogregr |
| - margins_mlogregr |
| |
| Bug Fixes: |
| - Fixed an overflow problem in LDA by using INT64 instead of INT32. |
| - Fixed integer to boolean cast bug in clustered variance for logistic |
| regression. After this fix, integer columns are accepted for binary |
| dependent variable using the 'integer to bool' cast rules. |
| - Fixed two bugs in SVD: |
| - The 'example' option for online help has been fixed |
| - Column names for sparse input tables in the 'svd_sparse' and |
| 'svd_sparse_native' functions are no longer restricted to 'row_id', |
| 'col_id' and 'value'. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.3 |
| |
| Release Date: 2013-October-03 |
| |
| New Features: |
| * Cox Proportional Hazards: |
| - Added stratification support for Cox PH models. Stratification is used as |
| shorthand for building a Cox model that allows for more than one stratum, |
| and hence, allows for more than one baseline hazard function. |
| Stratification provides two pieces of key, flexible functionality for the |
| end user of Cox models: |
| -- Allows a categorical variable Z to be appropriately accounted for in |
| the model without estimating its predictive impact on the response |
| variable. |
| -- Categorical variable Z is predictive/associated with the response |
| variable, but Z may not satisfy the proportional hazards assumption |
| - Added a new function (cox_zph) that tests the proportional hazards |
| assumption of a Cox model. This allows the user to build Cox models and then |
| verify the relevance of the model. |
| * NULL Handling: |
| - Modified behavior of linear and logistic regression to 'omit' rows |
| containing NULL values for any of the dependent and independent variables. |
| The number of rows skipped is provided as part of the output table. |
| |
| Deprecated functions: |
| - Cox Proportional Hazard function has been renamed to 'coxph_train'. |
| Old function names ('cox_prop_hazards' and 'cox_prop_hazards_regr') |
| have been deprecated, and will be removed in the next major version update. |
| - The aggregate form of linear regression ('linregr') has been deprecated. |
| The stored-procedure form ('linregr_train') should be used instead. |
| |
| Bug Fixes: |
| - Fixed a memory leak in the Apriori algorithm. |
| |
| |
| -------------------------------------------------------------------------------- |
| MADlib v1.2 |
| |
| Release Date: 2013-September-06 |
| |
| New Features: |
| * ARIMA Timeseries modeling |
| - Added auto-regressive integrated moving average (ARIMA) modeling for |
| non-seasonal, univariate timeseries data. |
| - Module includes a training function to compute an ARIMA model and a |
| forecasting function to predict future values in the timeseries |
| - Training function employs the Levenberg-Marquardt algorithm (LMA) to |
| compute a numerical solution for the parameters of the model. The |
| observations and innovations for time before the first timestamp |
| are assumed to be zero leading to minimization of the conditional sum of |
| squares. This produces estimates referred to as conditional maximum likelihood |
| estimates (also referred as 'CSS' in some statistical packages). |
| * Documentation updates: |
| - Introduced a new format for documentation improving usability. |
| - Upgraded to Doxygen v1.84. |
| - Updated documentation improving consistency for multiple modules including |
| Regression methods, SVD, PCA, Summary function, and Linear systems. |
| Bug fixes: |
| - Checking out-of-bounds access of a 'svec' even if the size of svec is zero. |
| - Fixed a minor bug allowing use of GCC 4.7 and higher to build from source. |
| -------------------------------------------------------------------------------- |
| MADlib v1.1 |
| |
| Release Date: 2013-August-09 |
| |
| New Features: |
| * Singular Value Decomposition: |
| - Added Singular Value Decomposition using the Lanczos bidiagonalization |
| iterative method to decompose the original matrix into PBQ^t, where B is |
| a bidiagonalized matrix. We assume that the original matrix is too big to |
| load into memory but B can be loaded into the memory. B is then further |
| decomposed into XSY^T using Eigen's JacobiSVD function. This restricts the |
| number of features in the data matrix to about 5000. |
| - This implementation provides SVD (for dense matrix), SVD_BLOCK (also for |
| dense matrix but faster), SVD_SPARSE (convert a sparse matrix into a |
| dense one, slower) and SVD_SPARSE_NATIVE (directly operate on the sparse |
| matrix, much faster for really sparse matrices). |
| |
| * Principal Component Analysis: |
| - Added a PCA training function that generates the top-K principal |
| components for an input matrix. The original data is mean-centered by the |
| function with the mean matrix returned by the function as a separate table. |
| - The module also includes the projection function that projects a test data |
| set to the principal components returned by the train function. |
| |
| * Linear Systems: |
| - Added a module to solve linear system of equations (Ax = b). |
| - The module utilizes various direct methods from the Eigen library for |
| dense systems. Given below is a summary of the methods (more details at |
| http://eigen.tuxfamily.org/dox-devel/group__TutorialLinearAlgebra.html): |
| - Householder QR |
| - Partial Pivoting LU |
| - Full Pivoting LU |
| - Column Pivoting Householder QR |
| - Full Pivoting Householder QR |
| - Standard Cholesky decomposition (LLT) |
| - Robust Cholesky decomposition (LDLT) |
| - The module also includes direct and iterative methods for sparse linear |
| systems: |
| Direct: |
| - Standard Cholesky decomposition (LLT) |
| - Robust Cholesky decomposition (LDLT) |
| Iterative: |
| - In-memory Conjugate gradient |
| - In-memory Conjugate gradient with diagonal preconditioners |
| - In-memory Bi-conjugate gradient |
| - In-memory Bi-conjugate gradient with incomplete LU preconditioners |
| |
| Bug fixes and other changes: |
| * Robust input validation: |
| - Validation of input parameters to various functions has been improved to |
| ensure that it does not fail if double quotes are included as part of the |
| table name. |
| * Random Forest |
| - The ID field in rf_train has been expanded from INT to BIGINT (MADLIB-764) |
| * Various documentation updates: |
| - Documentation updated for various modules including elastic net, linear |
| and logistic regression. |
| -------------------------------------------------------------------------------- |
| MADlib v1.0 |
| |
| Release Date: 2013-July-03 |
| |
| New Features: |
| * Cox Proportional Hazards: |
| - Added Right Censoring support for Cox Prop Hazards |
| * Robust Variance Tests - Huber White: |
| - Added a method of calculating robust variance statistic by utilizing the |
| Huber-White sandwich estimator for linear regression, logistic regression, |
| and multinomial logistic regression |
| - Robust variance for linear and logistic regression also includes |
| grouping support |
| * Clustered Sandwich Estimators: |
| - Added clustered robust variance statistic by utilizing a clustered sandwich |
| estimator for linear regression, logistic regression, and multinomial |
| logistic regression |
| - Grouping is currently not implemented for clustered and parameter is only |
| a placeholder at present |
| * Marginal Effects Estimator: |
| - Added a method for computing the marginal effects for logistic regression |
| and multinomial logistic regression |
| - Grouping is currently not implemented for marginal effects and the |
| parameter is only a placeholder at present |
| * Multinomial logistic regression: |
| - Added a parameter in multinomial logistic regression, to enable picking |
| the reference category. Input for number of categories has been removed |
| due to redundancy |
| * Linear regression: |
| - Updated grouping columns to input as a comma delimited string rather |
| than as an array |
| - Resolved an issue with highly collinear data to produce results consistent |
| with other statistical packages. Threshold on condition number to use an |
| approximation for computing the pseudo-inverse was increased. |
| * Logistic regression: |
| - Changed behavior to error-out if the ouput table already exists |
| |
| Bug fixes: |
| * Summary: |
| - Summary function (when used with quartiles) used high memory when number |
| of column is large. This has been fixed by computing quartiles in an |
| iterative manner for a fixed number of columns (Pivotal-170) |
| - Fixed a problem with incorrect number of rows returned for Summary when |
| all values in a column are NULL (Pivotal-171) |
| -------------------------------------------------------------------------------- |
| MADlib v0.7 |
| |
| Release Date: 2013-May-01 |
| |
| New Features: |
| * Correlation function: |
| - Function to compute Pearson's cross-correlation for numeric columns in a |
| relational table |
| * Upgrade capability: |
| - All new versions since v0.7 are installed in a version-specific folder |
| (/usr/local/madlib/Versions/) |
| - Upgrade from v0.5/v0.6 to v0.7 on the database is now supported without |
| uninstalling previous MADlib database installation. |
| - Dependencies on updated functions, types, and other operators are caught |
| and upgrade is aborted with an appropriate message |
| |
| Bug fixes: |
| * Linear Regression: |
| - Improved matrix inversion method to compute coefficients comparable to R |
| for regression problems with high multicollinearity (MADLIB-790) |
| * Logistic Regression: |
| - Fixed a problem in logistic regression with grouping on 'text' datatype |
| columns (MADLIB-791) |
| |
| Known issues: |
| * Upgrade: |
| - Views dependent on MADlib functions being updated will be dropped during |
| the upgrade and restored after finishing upgrade. If upgrade fails for |
| any reason, these views and the original MADlib schema will *not* be |
| restored. Before initiating upgrade, we recommend taking a backup of |
| the MADlib schema and move all views dependent on MADlib to separate |
| schema and perform a backup with: |
| pg_dump -n 'schema_name' |
| |
| - Upgrade is currently not supported for the PostgreSQL platform and will |
| abort with an error |
| |
| - Upgrade currently does not detect functions defined by the user that |
| depend upon MADlib functions. Semantic/API changes to these MADlib |
| functions could lead to undefined results in such user-defined functions |
| |
| - Some important changes for the upgrade from v0.5 to v0.7 are given below |
| (Upgrade will raise an error and abort if there exist user-defined views |
| that depend on these changes. User-defined functions are not validated |
| with this check. An aborted upgrade does not affect the installed version |
| of MADlib.) |
| -- Logistic regression renamed from 'logregr' to 'logregr_train' |
| -- All internal and external aggregates in logistic regression |
| have been updated |
| -- PLDA module replaced with a refactored LDA module. Due to the |
| renaming all functions using PLDA need to be updated |
| -- Updated MADlib types: |
| logregr_result, plda_topics_t, plda_word_distrn, |
| plda_word_weight |
| -------------------------------------------------------------------------------- |
| MADlib v0.6 |
| |
| Release Date: 2013-Apr-01 |
| |
| New Features / Improvements: |
| * Generic cross-validation: |
| - Support for k-fold cross-validation of any supervised learning |
| algorithm |
| * Heteroskedasticity of linear regression |
| - Support for calculating heteroskedasticity via Breusch-Pagan test |
| * Grouping support for linear regression |
| - Support for linear regression on each group of data grouped by |
| one or multiple columns |
| * Grouping support for logistic regression |
| - Refactor of logistic regression code |
| - Support for logistic regression on each group of data grouped by |
| one or multiple columns |
| - Grouping support is added to the convex optimization framework |
| * LDA: |
| - Improved performance and scalability (MADLIB-480) |
| * Elastic net regularization for both linear and logistic regressions |
| - Support FISTA and IGD optimizers |
| * Summary function |
| - Support for an overview of data table |
| * Eigen package upgrade |
| - Now Eigen 3.1.2 is used by MADlib v0.6 |
| * Unit testing framework: |
| - A new unit testing framework is added for C++ abstraction layer |
| |
| Bug Fixes: |
| * C++ abstraction layer: |
| - Improved handling of NULL values in the input array (MADLIB-773) |
| * Naive Bayes: |
| - Improved the handling of NULL values. (MADLIB-749) |
| |
| Known Issues: |
| |
| * K-means: |
| - K-means crashes on some datasets, when the dimensionality of the points |
| is not uniform on the data set. (MADLIB-789) |
| |
| * Distribution Functions: |
| - Certain quantile functions will abort their session on invalid input |
| (MADLIB-786) |
| |
| * Multinomial Logistic Regression: |
| - Signs of coefficient outputs are inconsistent with other tools like R and |
| Stata (MADLIB-785) |
| |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.5 |
| |
| Release Date: 2012-Nov-15 |
| |
| Bug Fixes: |
| * K-means: |
| - Improved handling of invalid arguments (MADLIB-359, 361) |
| * Sketch-based estimators: |
| - Addressed security vulnerability (MADLIB-630) |
| |
| New Features / Improvements: |
| * Association Rules (Apriori): |
| - Improved reporting output format for better usability (MADLIB-411) |
| - Significant improvement in performance (MADLIB-638) |
| * C++ (Database) Abstraction Layer: |
| - Extension to support modular transition states (MADLIB-499) |
| - Extension to support functions returning set of values (MADLIB-638) |
| * Conditional Random fields: |
| - Support for Linear Chain Conditional Random Fields for NLP (MADLIB-628) |
| * Decision Tree: |
| - Improved performance for C4.5 and Random forests (MADLIB-605) |
| - Improved encoding (MADLIB-590) |
| * Infrastructure: |
| - Convex optimization framework |
| * K-means: |
| - Code refactoring and Improved performance |
| (MADLIB-454, MADLIB-522, MADLIB-678) |
| - Silhouette function for k-means (MADLIB-681) |
| * Low-rank Matrix Factorization |
| - New module |
| * Logistic Regression: |
| - Support for Multinomial Logistic Regression (MADLIB-575) |
| * Naive Bayes |
| - Significant improvement in performance (MADLIB-611, 619, 626) |
| * Regression Analysis: |
| - Support for Cox Proportional Hazards test (MADLIB-576) |
| * Sampling |
| - Added weighted sampling of a single row (MADLIB-584) |
| * SVD Matrix Factorization: |
| - Improved performance (MADLIB-578) |
| |
| Documentation: |
| * Conditional Random Fields: |
| - Example added for CRF module (MADLIB-731) |
| * SVD Matrix Factorization: |
| - Incremental-gradient SVD algorithm (MADLIB-572) |
| |
| Known issues: |
| * Multinomial Logistic Regression: |
| - Number of independent variables cannot exceed 65535 (MADLIB-665) |
| * Naive Bayes: |
| - Current implementation of Naive Bayes is only suitable for |
| categorical attributes (MADLIB-679) |
| - NULL input values not accepted for attributes (MADLIB-614) |
| - NULL probabilities given for test set values not seen in |
| training set (MADLIB-523) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.4.1 |
| |
| Release Date: 2012-Aug-9 |
| |
| Bug Fixes: |
| * PGXN: |
| - Fixed installation problem that could occur on some platforms (MADLIB-589) |
| |
| New Features/Improvements: |
| * C++ Abstraction Layer: |
| - Increased ABI compatibility across multiple Greenplum versions |
| (MADLIB-606) |
| * Hypothesis Tests: |
| - Tests that are not implemented as ordered aggregates are now also |
| installed on PostgreSQL 8.4 and Greenplum 4.0. |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.4 |
| |
| Release Date: 2012-Jun-18 |
| |
| Bug Fixes: |
| * Association Rules: |
| - assoc_rules() now uses schema-qualified function calls (MADLIB-435) |
| * Decision Trees: |
| - Enhanced correctness (MADLIB-409, 502, 503) |
| - Improved handling of invalid arguments (MADLIB-331) |
| * k-Means: |
| - Improved handling of invalid arguments (MADLIB-336, 364, 459) |
| * PLDA: |
| - Improved robustness (MADLIB-474) |
| * Sparse Vectors: |
| - svec_sfv() now uses locale-aware sorting (MADLIB-457) |
| - Operators now install to MADlib schema (MADLIB-470) |
| |
| New Features/Improvements: |
| * C++ Abstraction Layer: |
| - Support for "function pointers" (MADLIB-370) |
| - Support for sparse vectors (MADLIB-371) |
| - Support for more Eigen (linear algebra) types (MADLIB-533) |
| * Decision Trees: |
| - Code refactoring and optimization (MADLIB-410, 476, 504, 509) |
| - Documentation improvments (MADLIB-507) |
| - Output table now contains unencoded information (MADLIB-434) |
| - Enhance the missing value handling for continuous features (MADLIB-493) |
| * Hypothesis Tests: |
| - Pearson chi-square test (MADLIB-390) |
| - One- and two-sample t-Tests (MADLIB-391) |
| - F-test (MADLIB-392) |
| - Mann-Whitney U-test (MADLIB-393) |
| - Kolmogorov-Smirnov test (MADLIB-394) |
| - Wilcoxon-Signed-Rank test (MADLIB-405) |
| - One-way ANOVA (MADLIB-406) |
| * PostgreSQL Extensibility: |
| - Support for CREATE EXTENSION in PostgreSQL >= 9.1 (MADLIB-316) |
| - Availability on PGXN (MADLIB-334) |
| * Probability Functions: |
| - Wrap all distribution functions implemented by Boost (MADLIB-412) |
| - Wrap Kolmogorov distribution function from CERN ROOT project (MADLIB-413) |
| * Random Forests: |
| - New module (MADLIB-419) |
| * Support: |
| - Add elementary matrix/vector functions (e.g., norm/distances etc.) |
| (MADLIB-532) |
| * Viterbi Feature Extraction: |
| - New module (MADLIB-478) |
| |
| Known issues: |
| - svec_sfv() does not support collations, as introduced with PostgreSQL 9.1 |
| (MADLIB-558) |
| - Invalid arguments are not always guaranteed to be handled gracefully and |
| may lead to confusing error messages (MADLIB-28, 359, 361, 363) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.3 |
| |
| Release Date: 2012-Feb-9 |
| |
| New features: |
| * Installer: |
| - Single installer package targeting all supported DBMSs per OS (MADLIB-218) |
| * C++ Abstraction Layer: |
| - Switched from using Armadillo to using Eigen for linear-algebra |
| operations, thereby eliminating the dependency on LAPACK/BLAS (MADLIB-275) |
| - Reimplemented as a template library for performance improvements |
| (MADLIB-295) |
| * Decision Trees: |
| - Major update |
| - Now supports multiple split criteria (information gain, gini, gain ratio) |
| - Now supports tree pruning using a validation set to address over fitting |
| - Now supports additional functions for tree output |
| - Now supports continuous features in addition to categorical features |
| - Additional support for handling null values |
| - Improved scalability and performance |
| * k-Means Clustering: |
| - Now handles any input that is convertible to SVEC. (MADLIB-42) |
| - Multiple distance functions (L1-norm, L2-norm, cosine similarity, Tanimoto |
| similarity) (MADLIB-43) |
| - Supports multiple seedings methods (kmeans++, random, user-specified list |
| of centroids) |
| - Replaced goodness of fit with the (simplified) Silhouette coefficient |
| (MADLIB-45) |
| - New run-time parameters (MADLIB-47) |
| * Linear Regression: |
| - Major speed improvement |
| * Logistic Regression: |
| - Major speed improvement |
| - Now handles any input that is convertible to BOOLEAN (dependent variable) |
| or DOUBLE PRECISION[] (independent variables). (MADLIB-283) |
| - An under-/overflow safe version to evaluate the (usual) logistic function, |
| for scoring logistic regression (MADLIB-271) |
| - A third optimizer: Incremental-gradient-descent (MADLIB-303) |
| * Support: |
| - For Greenplum <= 4.2.0, added a workaround for INSERT INTO in the same way |
| as the existing CREATE TABLE AS workaround. This workaround is not needed |
| in Greenplum >= 4.2.1 any more. (MADLIB-265) |
| - Function version() returns Madlib build information (MADLIB-309) |
| |
| Bug fixes: |
| * Sparse vectors: |
| - Fixed sparse-vector type case problems (MADLIB-282, MADLIB-305) |
| - Fixed a situation where using svec_svf() could cause a segmentation fault |
| (MADLIB-350) |
| - Increased compatibility with internal PostgreSQL conventions (MADLIB-257) |
| * Logistic regression: |
| - Handle numerical instability more gracefully (MADLIB-343, MADLIB-345) |
| - Handle unexpected inputs more gracefully (MADLIB-284, MADLIB-344) |
| - Fixed "Random variate x is nan, but must be finite" issue (MADLIB-356) |
| |
| Known issues: |
| - Decision Trees not supported on Greenplum 4.0 (MADLIB-346, MADLIB-347) |
| - K-means: the error '"nan" does not exist' may be raised when input vectors |
| contain NaN. (MADLIB-364) |
| - Association Rules require the madlib schema to be in the search path |
| (MADLIB-353) |
| - Invalid arguments are not always guaranteed to be handled gracefully and |
| may lead to confusing error messages (MADLIB-28, 336, 359, 361, 363, 364) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.2.1beta |
| |
| Release Date: 2011-Sep-14 |
| |
| General changes: |
| * numerous improvements to the C++ abstraction layer: |
| - code clean-up |
| - fixed issue where incorrect values were returned when used with |
| debug builds of PostgreSQL/Greenplum (MADLIB-253) |
| - fixed issue where returning arrays to PostgreSQL/Greenplum could lead |
| to a crash (MADLIB-250) |
| - allocated memory is now 16-byte aligned for improved stability and |
| performance (MADLIB-236) |
| * compiling with advanced warnings enabled by default now |
| * all C/C++ code now free of warnings. On gcc <= 4.6, there might still be |
| warnings due to "unclean" macros in DBMS header files (MADLIB-228) |
| * prepared Solaris support in a later release (MADLIB-204) |
| - added support for Sun Compiler in CMake build script |
| - fixed all compilation errors with Sun compiler |
| * added UDF to mimic "CREATE TABLE AS ...", as a workaround for a Greenplum |
| issue (MADLIB-241). Included this as GP Compatibility module. |
| * madpack utility: |
| - dropped madpack dependency on PygreSQL (MADLIB-217) |
| - improved security in madpack install-check (MADLIB-229) |
| - fixed bashism in madpack (MADLIB-222) |
| - fixed install-check not running on non-default schema (MADLIB-251) |
| |
| Modules/methods: |
| * SVM (kernel_machines): |
| - fixed cumulative error count in svm_cls_update() function |
| - improved memory management in SVM module |
| * Linear regression (regress): |
| - fixed unexpected behavior for some edge cases (MADLIB-214) |
| - fixed crashing with huge number of independent vars (MADLIB-250) |
| * Logistic regression (regress): |
| - added support for arbitrary expressions for dep./indep. variables, not |
| just column names (MADLIB-255) |
| * Quantile: |
| - fixed quantile() function to be exact |
| - added simple version for small data sets |
| * Sparse Vectors: |
| - added check for sorted dictionary to svec_sfv (MADLIB-187) |
| * Decision Tree (decision_tree): |
| - now can be run multiple times in one session (MADLIB-156) |
| |
| Known issues: |
| * non-unified API for several SQL UDFs (MADLIB-208) |
| * performance of the conjugate-gradient optimizer in logistic regression |
| can be very poor (MADLIB-164) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.2.0beta |
| |
| Release Date: 2011-Jul-8 |
| |
| General changes: |
| * new build and installation framework based on CMake |
| * new C++ abstraction layer for easy and secure method development |
| * new database installation utility (madpack) |
| |
| Modules/methods: |
| * new: Association Rules (assoc_rules) |
| * new: Array Operators (array_ops) |
| * new: Decision Tree (decision_tree) |
| * new: Conjugate Gradient (conjugate_gradient) |
| * new: Parallel LDA (plda) |
| * improved: all methods from previous release |
| |
| Known issues: |
| * non-unified API for several SQL UDFs (MADLIB-208) |
| * running decision tree more than once in one session fails (MADLIB-156) |
| * performance of the conjugate-gradient optimizer in logistic regression |
| can be very poor (MADLIB-164) |
| * svec_sfv function doesn't check for sorted dictionary (MADLIB-187) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.1.0alpha |
| |
| Release Date: 2011-Jan-31 |
| |
| Initial release. |
| |
| Included modules/methods: |
| * Naive-Bayes Classification (bayes) |
| * k-Means Clustering (kmeans) |
| * Support Vector Machines (kernel_machines) |
| * Sketch-based Estimators (sketch) |
| * Sketch-based Profile (data_profile) |
| * Quantile (quantile) |
| * Linear & Logistic Regression (regress) |
| * SVD Matrix Factorisation (svdmf) |
| * Sparse Vectors (svec) |
| |
| -------------------------------------------------------------------------------- |
| MADlib v0.1.0prerelease |
| |
| Release date: 2011-Jan-25 |
| |
| Demo release. |