blob: cea26a48d4318efef1262be94bcd7f433829b471 [file] [log] [blame]
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Decision Trees}
\label{sec:decision_trees}
\noindent{\bf Description}
\smallskip
Decision tree (for classification) is a classifier that is considered
more interpretable than other statistical classifiers. This implementation
is well-suited to handle large-scale data and builds a (binary) decision
tree in parallel.\\
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}decision-tree.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} R=}path/file
{\tt{} bins=}integer
{\tt{} depth=}integer
{\tt{} num\_leaf=}integer
{\tt{} num\_samples=}integer
{\tt{} impurity=}Gini$\mid$entropy
{\tt{} M=}path/file
{\tt{} O=}path/file
{\tt{} S\_map=}path/file
{\tt{} C\_map=}path/file
{\tt{} fmt=}format
}
\smallskip
\noindent{\bf Usage: Prediction}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}decision-tree-predict.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} R=}path/file
{\tt{} M=}path/file
{\tt{} P=}path/file
{\tt{} A=}path/file
{\tt{} CM=}path/file
{\tt{} fmt=}format
}\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the matrix of feature vectors;
each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
\item[{\tt Y}:]
Location (on HDFS) to read the matrix of (categorical)
labels that correspond to feature vectors in $X$. Note that class labels are assumed to be both recoded and dummy coded.
This argument is optional for prediction.
\item[{\tt R}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
If $R$ is not provided by default all features are assumed to be continuous-valued.
\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning).
\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
Maximum depth of the learned tree
\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
Parameter that controls pruning. The tree
is not expanded if a node receives less than {\tt num\_leaf} training examples.
\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
Parameter that decides when to switch to in-memory building of subtrees. If a node $v$ receives less than {\tt num\_samples}
training examples then this implementation switches to an in-memory subtree
building procedure to build the subtree under $v$ in its entirety.
\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
Impurity measure used at internal nodes of the tree for selecting which features to split on. Possible value are entropy or Gini.
\item[{\tt M}:]
Location (on HDFS) to write matrix $M$ containing the learned decision tree (see below for the schema)
\item[{\tt O}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to store the training accuracy (\%). Note that this argument is optional.
\item[{\tt A}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to store the testing accuracy (\%) from a
held-out test set during prediction. Note that this argument is optional.
\item[{\tt P}:]
Location (on HDFS) to store predictions for a held-out test set
\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
see read/write functions in SystemML Language Reference for details.
\end{Description}
\noindent{\bf Details}
\smallskip
Decision trees~\cite{BreimanFOS84:dtree} are simple models of
classification that, due to their structure, are easy to
interpret. Given an example feature vector, each node in the learned
tree runs a simple test on it. Based on the result of the test, the
example is either diverted to the left subtree or to the right
subtree. Once the example reaches a leaf, then the label stored at the
leaf is returned as the prediction for the example.
Building a decision tree from a fully labeled training set entails
choosing appropriate splitting tests for each internal node in the tree and this is usually performed in a top-down manner.
The splitting test (denoted by $s$) requires
first choosing a feature $j$ and depending on the type of $j$, either
a threshold $\sigma$, in case $j$ is continuous-valued, or a subset of
values $S \subseteq \text{Dom}(j)$ where $\text{Dom}(j)$ denotes
domain of $j$, in case it is categorical. For continuous-valued
features the test is thus of form $x_j < \sigma$ and for categorical
features it is of form $x_j \in S$, where $x_j$ denotes the $j$th
feature value of feature vector $x$. One way to determine which test
to include, is to compare impurities of the tree nodes induced by the test.
The {\it node impurity} measures the homogeneity of the labels at the node. This implementation supports two commonly used impurity measures (denoted by $\mathcal{I}$): {\it Entropy} $\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$, as well as {\it Gini impurity} $\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of unique labels and $f_i$ is the frequency of label $i$.
Once the impurity at the tree nodes has been obtained, the {\it best split} is chosen from a set of possible splits that maximizes the {\it information gain} at the node, i.e., $\arg\max_{s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the information gain when the splitting test $s$ partitions the feature matrix $X$.
Assuming that $s$ partitions $X$ that contains $N$ feature vectors into $X_\text{left}$ and $X_\text{right}$ each including $N_\text{left}$ and $N_\text{right}$ feature vectors, respectively, $\mathcal{IG}(X,s)$ is given by
\begin{equation*}
\mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right}),
\end{equation*}
where $\mathcal{I}\in\{\mathcal{E},\mathcal{G}\}$.
In the following we discuss the implementation details specific to {\tt decision-tree.dml}.
\textbf{Input format.}
In general implementations of the decision tree algorithm do not require categorical features to be dummy coded. For improved efficiency and reducing the training time, our implementation however assumes dummy coded categorical features and dummy coded class labels.
\textbf{Tree construction.}
Learning a decision tree on large-scale data has received some
attention in the literature. The current implementation includes logic
for choosing tests for multiple nodes that belong to the same level in
the decision tree in parallel (breadth-first expansion) and for
building entire subtrees under multiple nodes in parallel (depth-first
subtree building). Empirically it has been demonstrated that it is
advantageous to perform breadth-first expansion for the nodes
belonging to the top levels of the tree and to perform depth-first
subtree building for nodes belonging to the lower levels of the tree~\cite{PandaHBB09:dtree}. The parameter {\tt num\_samples} controls when we
switch to depth-first subtree building. Any node in the decision tree
that receives $\leq$ {\tt num\_samples} training examples, the subtree
under it is built in its entirety in one shot.
\textbf{Stopping rule and pruning.}
The splitting of data at the internal nodes stops when at least one the following criteria is satisfied:
\begin{itemize}
\item the depth of the internal node reaches the input parameter {\tt depth} controlling the maximum depth of the learned tree, or
\item no candidate split achieves information gain.
\end{itemize}
This implementation also allows for some automated pruning via the argument {\tt num\_leaf}. If
a node receives $\leq$ {\tt num\_leaf} training examples, then a leaf
is built in its place.
\textbf{Continuous-valued features.}
For a continuous-valued feature
$j$ the number of candidate thresholds $\sigma$ to choose from is of
the order of the number of examples present in the training set. Since
for large-scale data this can result in a large number of candidate
thresholds, the user can limit this number via the arguments {\tt bins} which controls the number of candidate thresholds considered
for each continuous-valued feature. For each continuous-valued
feature, the implementation computes an equi-height histogram to
generate one candidate threshold per equi-height bin.
\textbf{Categorical features.}
In order to determine the best value subset to split on in the case of categorical features, this implementation greedily includes values from the feature's domain until the information gain stops improving.
In particular, for a categorical feature $j$ the $|Dom(j)|$ feature values are sorted by impurity and the resulting split candidates $|Dom(j)|-1$ are examined; the sequence of feature values which results in the maximum information gain is then selected.
\textbf{Description of the model.}
The learned decision tree is represented in a matrix $M$ that
contains at least 6 rows. Each column in the matrix contains the parameters relevant to a single node in the tree.
Note that for building the tree model, our implementation splits the feature matrix $X$ into $X_\text{cont}$ containing continuous-valued features and $X_\text{cat}$ containing categorical features. In the following, the continuous-valued (resp. categorical) feature-ids correspond to the indices of the features in $X_\text{cont}$ (resp. $X_\text{cat}$).
Moreover, we refer to an internal node as a continuous-valued (categorical) node if the feature that this nodes looks at is continuous-valued (categorical).
Below is a description of what each row in the matrix contains.
\begin{itemize}
\item Row 1: stores the node-ids. These ids correspond to the node-ids in a complete binary tree.
\item Row 2: for internal nodes stores the offsets (the number of columns) in $M$ to the left child, and otherwise 0.
\item Row 3: stores the feature index of the feature (id of a continuous-valued feature in $X_\text{cont}$ if the feature is continuous-valued or id of a categorical feature in $X_\text{cat}$ if the feature is categorical) that this node looks at if the node is an internal node, otherwise 0.
\item Row 4: store the type of the feature that this node looks at if the node is an internal node: 1 for continuous-valued and 2 for categorical features,
otherwise the label this leaf node is supposed to predict.
\item Row 5: for the internal nodes contains 1 if the feature chosen for the node is continuous-valued, or the size of the subset of values used for splitting at the node stored in rows 6,7,$\ldots$ if the feature chosen for the node is categorical. For the leaf nodes, Row 5 contains the number of misclassified training examples reaching at this node.
\item Row 6,7,$\ldots$: for the internal nodes, row 6 stores the threshold to which the example's feature value is compared if the feature chosen for this node is continuous-valued, otherwise if the feature chosen for this node is categorical rows 6,7,$\ldots$ store the value subset chosen for the node.
For the leaf nodes, row 6 contains 1 if the node is impure and the number of training examples at the node is greater than {\tt num\_leaf}, otherwise 0.
\end{itemize}
As an example, Figure~\ref{dtree} shows a decision tree with $5$ nodes and its matrix
representation.
\begin{figure}
\begin{minipage}{0.3\linewidth}
\begin{center}
\begin{tikzpicture}
\node (labelleft) [draw,shape=circle,minimum size=16pt] at (2,0) {$2$};
\node (labelright) [draw,shape=circle,minimum size=16pt] at (3.25,0) {$1$};
\node (rootleft) [draw,shape=rectangle,minimum size=16pt] at (2.5,1) {$x_5 \in \{2,3\}$};
\node (rootlabel) [draw,shape=circle,minimum size=16pt] at (0.9,1) {$1$};
\node (root) [draw,shape=rectangle,minimum size=16pt] at (1.75,2) {$x_3 < 0.45$};
\draw[-latex] (root) -- (rootleft);
\draw[-latex] (root) -- (rootlabel);
\draw[-latex] (rootleft) -- (labelleft);
\draw[-latex] (rootleft) -- (labelright);
\end{tikzpicture}
\end{center}
\begin{center}
(a)
\end{center}
\end{minipage}
\hfill
\begin{minipage}{0.65\linewidth}
\begin{center}
\begin{tabular}{c|c|c|c|c|c|}
& Col 1 & Col 2 & Col 3 & Col 4 & Col 5\\
\hline
Row 1 & 1 & 2 & 3 & 6 & 7 \\
\hline
Row 2 & 1 & 0 & 1 & 0 & 0 \\
\hline
Row 3 & 3 & 5 & 0 & 0 & 0 \\
\hline
Row 4 & 1 & 1 & 2 & 2 & 1 \\
\hline
Row 5 & 1 & 0 & 2 & 0 & 0 \\
\hline
Row 6 & 0.45 & 0 & 2 & 0 & 0 \\
\hline
Row 7 & & & 3 & & \\
\hline
\end{tabular}
\end{center}
\begin{center}
(b)
\end{center}
\end{minipage}
\caption{(a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_\text{cont}$ (resp. $X_\text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.}
\label{dtree}
\end{figure}
\smallskip
\noindent{\bf Returns}
\smallskip
The matrix corresponding to the learned model as well as the training accuracy (if requested) is written to a file in the format specified. See
details where the structure of the model matrix is described.
Recall that in our implementation $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. If requested, the mappings of the continuous-valued feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) and the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided.
Depending on what arguments are provided during
invocation, the {\tt decision-tree-predict.dml} script may compute one or more of predictions, accuracy and confusion matrix in the requested output format.
\smallskip
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f decision-tree.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
R=/user/biadmin/R.csv M=/user/biadmin/model.csv
bins=20 depth=25 num\_leaf=10 num\_samples=3000 impurity=Gini fmt=csv
}\smallskip
\noindent To compute predictions:
{\hangindent=\parindent\noindent\tt
\hml -f decision-tree-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv
A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv
}\smallskip
%\noindent{\bf References}
%
%\begin{itemize}
%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
%\item L. Breiman, J. Friedman, R. Olshen, and C. Stone. \newblock{Classification and Regression Trees}. Wadsworth and Brooks, 1984.
%\end{itemize}