blob: 57ab7bf21b1d14ef7e1012fcd161080bdc377490 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
= Decision Trees
Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.
== Overview
Decision trees are a simple yet powerful model in supervised machine learning. The main idea is to split a feature space into regions such as that the value in each region varies a little. The measure of the values' variation in a region is called the impurity of the region.
Apache Ignite provides an implementation of the algorithm optimized for data stored in rows (see link:machine-learning/partition-based-dataset[Partition Based Dataset]).
Splits are done recursively and every region created from a split can be split further. Therefore, the whole process can be described by a binary tree, where each node is a particular region and its children are the regions derived from it by another split.
Let each sample from a training set belong to some space `S` and let `p_i` be a projection on a feature with index `i`, then a split by continuous feature with index `i` has the form:
image::images/555.gif[]
and a split by categorical feature with values from some set `X` has the form:
image::images/666.gif[]
Here `X_0` is a subset of `X`.
The model works this way - the split process stops when either the algorithm has reached the configured maximal depth, or splitting of any region has not resulted in significant impurity loss. Prediction of a value for point `s` from `S` is a traversal of the tree down to the node that corresponds to the region containing `s` and getting back a value associated with this leaf.
== Model
The Model in a decision tree classification is represented by the class `DecisionTreeNode`. We can make a prediction for a given vector of features in the following way:
[source, java]
----
DecisionTreeNode mdl = ...;
double prediction = mdl.apply(observation);
----
The model is a fully independent object and after the training it can be saved, serialized and restored.
== Trainer
A Decision Tree algorithm can be used for classification and regression depending upon the impurity measure and node instantiation approach.
=== Classification
The Classification Decision Tree uses the https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity[Gini] impurity measure and you can use it in the following way:
[source, java]
----
// Create decision tree classification trainer.
DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(
4, // Max deep.
0 // Min impurity decrease.
);
// Train model.
DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
----
== Examples
To see how the Decision Tree can be used in practice, try this https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/tree/DecisionTreeClassificationTrainerExample.java[classification example] that is available on GitHub and delivered with every Apache Ignite distribution.