Apache Hivemall Overview

Apache Hivemall is a scalable machine learning library that runs on Apache Hive/Pig/Spark. Apache Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

Supported Algorithms

Apache Hivemall provides machine learning functionality as well as feature engineering functions through UDFs/UDAFs/UDTFs of Hive.

Binary Classification

  • Perceptron

  • Passive Aggressive (PA, PA1, PA2)

  • Confidence Weighted (CW)

  • Adaptive Regularization of Weight Vectors (AROW)

  • Soft Confidence Weighted (SCW1, SCW2)

  • AdaGradRDA (w/ hinge loss)

  • Factorization Machine (w/ logistic loss)

My recommendation is AROW, SCW1, AdaGradRDA, and Factorization Machine while it depends.

Multi-class Classification

  • Perceptron

  • Passive Aggressive (PA, PA1, PA2)

  • Confidence Weighted (CW)

  • Adaptive Regularization of Weight Vectors (AROW)

  • Soft Confidence Weighted (SCW1, SCW2)

  • Random Forest Classifier

  • Gradient Tree Boosting (Experimental)

My recommendation is AROW and SCW while it depends.

Regression

My recommendation for is AROW regression, AdaDelta, and Factorization Machine while it depends.

Recommendation

k-Nearest Neighbor

  • Minhash (LSH with jaccard index)

  • b-Bit minhash

  • Brute-force search using Cosine similarity

Anomaly Detection

Natural Language Processing

  • English/Japanese Text Tokenizer

Feature engineering

System requirements

  • Hive 0.13 or later

  • Java 7 or later

  • Spark 2.1 or later for Apache Hivemall on Spark

  • Pig 0.15 or later for Apache Hivemall on Pig

More detail in documentation.