doc/design/modules/decision-trees.tex - madlib - Git at Google

 % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines.
 % !TEX root = ../design.tex

 \chapter[Decision trees]{Decision Trees: Classification and Regression}
 \begin{moduleinfo}
 \item[Authors] {Rahul Iyer and Liquan Pei}
 \item[History]
     \begin{modulehistory}
             \item[v0.2] Parallelism
             \item[v0.1] Initial version: introduction, theory, and interface
     \end{modulehistory}
 \end{moduleinfo}

 % Abstract. What is the problem we want to solve?
 \section{Introduction} % (fold)
 \label{sec:introduction}


 \emph{Notes, examples and figures in this section are borrowed from
 \cite{hastie2008statisticallearning} and \cite{james2013statisticallearning}}.


 Tree-based methods for regression and classification involve stratifying or
 segmenting the predictor space into a number of simple regions. In order to make
 a prediction for a given observation, we typically use the mean or the mode of
 the training observations in the region to which it belongs. Since the set of
 splitting rules used to segment the predictor space can be summarized in a tree,
 these types of approaches are known as decision tree methods. Tree-based methods
 are simple and useful for interpretation. However, they typically are not
 competitive with the best supervised learning approaches, in terms of
 prediction accuracy. The results from multiple decision tree are usually combined
 to yield a single consensus prediction often resulting in dramatic improvements
 in prediction accuracy, at the expense of some loss in interpretation.

 \subsection{Basics of Decision Trees} % (fold)
 \label{sub:basics_of_decision_trees}
 In order to motivate tree methods, we begin with a simple example.

 \begin{figure}[h!tb]
     \begin{center}
         \includegraphics[width=0.8\textwidth]{figures/basics_decision_tree}
     \end{center}
     \caption{(Figure from \cite{hastie2008statisticallearning}.)
     Top right panel shows a partition of a two-dimensional
     feature space by recursive binary splitting, applied to
     some arbitrary data. Top left panel shows a general partition that cannot be
     obtained from recursive binary splitting. Bottom left panel shows the tree
     corresponding to the partition in the top right panel, and a perspective
     plot of the prediction surface appears in the bottom right panel.}
     \label{fig:basics_decision_tree}
 \end{figure}
 Let's consider a regression problem with continuous response $Y$ and inputs
 $X_1$ and $X_2$, each taking values in the unit interval. The top left panel of
 Figure~\ref{fig:basics_decision_tree} shows a partition of the feature space by
 lines that are parallel to the coordinate axes. In each partition element we can
 model $Y$ with a different constant. However, there is a problem: although each
 partitioning line has a simple description like $X_1 = c$, some of the resulting
 regions are complicated to describe.

 To simplify matters, we restrict attention to recursive binary partitions like
 that in the top right panel of Figure~\ref{fig:basics_decision_tree}. We first
 split the space into two regions, and model the response by the mean of $Y$ in
 each region. We choose the variable and split-point to achieve the best fit.
 Then one or both of these regions are split into two more regions, and this
 process is continued, until some stopping rule is applied. For example, in the
 top right panel of Figure~\ref{fig:basics_decision_tree}, we first split at
 $X_1 = t1$. Then the region $X1 \le t1$ is split at $X2 = t2$ and the region
 $X1 > t1$ is split at $X1 = t3$. Finally,the region $X1 > t3$ is split at
 $X2 = t4$. The result of this process is a partition into the five
 regions $R_1, R_2, \ldots, R_5$ shown in the figure. The corresponding regression
 model predicts $Y$ with a constant $c_m$ in region $R_m$, that is,
 $$
     f(X) = \sum_{m=1}^{5} c_m I{(X_1, X_2) \in R_m}.
 $$

 This same model can be represented by the binary tree in the bottom left panel
 of Figure~\ref{fig:basics_decision_tree}. The full dataset sits at the top of
 the tree. Observations satisfying the condition at each junction are assigned to
 the left branch, and the others to the right branch. The terminal nodes or
 leaves of the tree correspond to the regions $R_1, R_2, \ldots, R_5$. The bottom
 right panel is a perspective plot of the regression surface from this model.

 \paragraph{Prediction via Stratification of the Feature Space:}

 We now discuss the process of building a regression tree. Roughly speaking,
 there are two steps.
 \begin{enumerate}
     \item We divide the predictor space - that is, the set of possible values for
 $X1, X2, \ldots, Xp$ into $J$ distinct and non-overlapping regions, $R_1, R_2,
 \ldots, R_J$.
     \item For every observation that falls into the region $R_j$, we make the same
     prediction, which is simply the mean of the response values for the training
     observations in $R_j$.
 \end{enumerate}

 \paragraph{Advantages of using tree-based models:}

 \begin{enumerate}
     \item Trees are very easy to explain to people.
     \item Some believe that decision trees more closely mirror human decision-making.
     \item Trees can be displayed graphically, and are easily interpreted even by a non-expert.
     \item Trees can easily handle qualitative predictors without the need to create dummy variables.
 \end{enumerate}
 % subsection basics_of_decision_trees (end)

 \subsection{Trees Versus Linear Models} % (fold)
 \label{sub:trees_versus_linear_models}
 \begin{figure}[h!tb]
     \begin{center}
         \includegraphics[width=0.7\textwidth]{figures/decision_tree_linear_model}
     \end{center}
     \caption{(Figure from \cite{james2013statisticallearning}.)
 Top Row: A two-dimensional classification example in which the true decision
 boundary is linear, and is indicated by the shaded regions. A classical approach
 that assumes a linear boundary (left) will outperform a decision tree that
 performs splits parallel to the axes (right). Bottom Row: Here the true decision
 boundary is non-linear. Here a linear model is unable to capture the true
 decision boundary (left), whereas a decision tree is successful (right).}
     \label{fig:decision_trees_linear_models}
 \end{figure}

 Regression and classification trees have a very different flavor from the more
 classical approaches for regression presented in \ref{chap:regression}.
 In particular, linear regression assumes a model of the form
 $$
         f(X) = \beta_0 + \sum_{j=1}^{p} X_j \beta_j,
 $$
 whereas regression trees assume a model of the form
 $$
         f(X) = \sum_{m=1}^{M} c_m \cdot 1_{X \in R_m}
 $$
 where $R_1, \ldots, R_M$ represent a partition of feature space.
 The efficacy of the two models depend on the problem at hand, as illustrated by
 Figure~\ref{fig:decision_trees_linear_models}
 % subsection trees_versus_linear_models (end)
 % section introduction (end)

 \section{Interface} % (fold)
 \label{sec:interface}
 The interface of the decision tree functionality includes two functions,
 \texttt{tree\_train} and \texttt{tree\_predict}, described below.

 \subsection{Training} % (fold)
 \label{sub:training}
 \begin{sql}
     SELECT tree_train(
             training_table_name,
             output_table_name,
             id_col_name,
             dependent_variable,
             list_of_features,
             list_of_features_to_exclude,
             split_criterion,
             grouping_cols,
             weights,
             max_depth,
             min_split,
             min_bucket,
             surrogate_params,   -- not implemented
             pruning_params,
             verbose
     )
 \end{sql}

 \paragraph{Arguments:}

 \begin{itemize}
     % \item \emph{training\_algorithm}: Can be either `c4.5' or `cart'
     \item \emph{training\_table\_name}: Name of the table containing data.
     \item \emph{output\_table\_name}: Name of the table to output the model.
     \item \emph{id\_col\_name}: Name of column containing the id information in training data.
     \item \emph{dependent\_variable}: Name of the column that contains the
     output for training. Boolean, integer and text are considered classification
     outputs, while float values are considered regression outputs.
     \item \emph{list\_of\_features}: List of column names (comma-separated string)
     to use as predictors. Can also be a `*' implying all columns are to be used
     as predictors (except the ones included in the next argument). Boolean, integer, and
     text columns are considered categorical columns.
     \item \emph{list\_of\_features\_to\_exclude}: OPTIONAL. List of column names
     (comma-separated string) to exlude from the predictors list.
     \item \emph{split\_criterion}: OPTIONAL (Default = `gini'). Various options
     to compute the feature to split a node. Available options are `gini',
     `cross-entropy', and `misclassification' for classification.
     For regression tree, the input for argument is ignored and the
     split\_criterion is always set to `mse' (mean-squared error).
     \item \emph{grouping\_cols}: OPTIONAL. List of column names (comma-separated string) to group
     the data by. This will lead to creating multiple decision trees, one for each group.
     \item \emph{weights}: OPTIONAL. Column name containing weights for each observation.
     \item \emph{max\_depth}: OPTIONAL (Default = 10). Set the maximum depth of
     any node of the final tree, with the root node counted as depth 0.
     \item \emph{min\_split}: OPTIONAL (Default = 20). Minimum number of observations that
     must exist in a node for a split to be attempted.
     \item \emph{min\_bucket}: OPTIONAL (Default = minsplit/3). Minimum number of observations
     in any terminal node. If only one of minbucket or minsplit is specified,
     minsplit is set to minbucket*3 or minbucket to minsplit/3, as appropriate.
     % \item \emph{max\_compete}: OPTIONAL (Default = 0) Number of competitor splits
     % retained in the output. It is useful to know not just which split was chosen,
     % but which variable came in second, third, etc.
     \item \emph{n\_bins}: OPTIONAL (Default = 100) Number of bins to use during binning.
     Continuous-valued features are binned into discrete bins (per the quantile values)
     to compute split boundaries. This global parameter is used to compute the resolution of the bins. Higher number of bins will lead to higher processing time.
     \item \emph{surrogate\_params}: (Not implemented) OPTIONAL (String with multiple key-value parameters, default = NULL)
         \begin{enumerate}
             \item \emph{max\_surrogate}: OPTIONAL (Default = 0)
                 The number of surrogate splits retained in the output.
                 If this is set to zero the compute time will be reduced by half.
             % \item \emph{use\_surrogate}: OPTIONAL (Default = 1) Use of surrogates in splitting process.
             %     \begin{enumerate}
             %         \setcounter{enumi}{0}
             %         \item Display only, an observation with a missing value for the primary split rule
             %         is not sent further down the tree.
             %         \item Use surrogates to split subjects missing the primary variable; if all surrogates are missing the observation is not split.
             %         \item If all surrogates are missing, then send the observation in the majority direction.
             %     \end{enumerate}
             % \item \emph{surrogate\_style}: OPTIONAL (Default = 0)
             %     \begin{enumerate}
             %         \setcounter{enumi}{0}
             %         \item Use total number of correct classification for a potential surrogate variable
             %         \item Use the percent correct, calculated over the non-missing values of the surrogate
             %     \end{enumerate}
         \end{enumerate}
     \item \emph{pruning\_params}: OPTIONAL (String with multiple key-value parameters, default = NULL)
         \begin{enumerate}
                 \item \emph{cp}: OPTIONAL (Default = 0)
                     A complexity parameter that determines that a split is attempted only if it
                     decreases the overall lack of fit by a factor of `cp'.
                 \item \emph{n\_folds}: OPTIONAL (Default = 0; no cross-validation) Number of cross-validation folds
         \end{enumerate}
     \item \emph{verbose}: OPTIONAL (Default = False) Prints status information
      on the splits performed and any other information useful for debugging.
 \end{itemize}
 % subsection training (end)

 \subsection{Prediction} % (fold)
 \label{sub:prediction}
 \begin{sql}
     SELECT tree_predict(
             tree_model,
             new_data_table,
             output_table,
             type
     )
 \end{sql}

 \paragraph{Arguments:}

 \begin{itemize}
     \item \emph{tree\_model}: Name of the table containing the decision tree model.
     \item \emph{new\_data\_table}: Name of table containing prediction data.
     \item \emph{output\_table}: Name of table to output prediction results.
     \item \emph{type}: OPTIONAL (Default = `response'). For regression trees,
     `response', implies output is the predicted value. For classification trees,
     this can be `response', giving the classification prediction as output, or
     `prob', giving the class probabilities as output
     (for two classes, only a single probability value is output that corresponds
     to the first class when the two classes are sorted by name;
     in case of more than two classes, an array of class probabilities
     (a probability of each class) is output).
 \end{itemize}
 % subsection prediction (end)
 % section interface (end)
 \section{CART} % (fold)
 \label{sec:cart}
 CART stands for Classification and Regression Trees (\cite{breiman1984cart}). It
 is characterized by the fact that it constructs binary trees, namely each
 internal node has exactly two outgoing edges. The splits can be selected using
 any of the impurity metrics criteria and the obtained tree is pruned by
 cost-complexity pruning.

 \subsection{Impurity metrics} % (fold)
 \label{sub:impurity_metrics}
 \begin{figure}[h!tb]
     \begin{center}
         \includegraphics[width=0.65\textwidth]{figures/impurity_measures}
     \end{center}
     \caption{(Figure obtained from \cite{hastie2008statisticallearning})
 Node impurity measures for two-class classification, as a function of the
 proportion $p$ in class 2. Cross-entropy has been scaled to pass through (0.5, 0.5).}
     \label{fig:impurity_measures}
 \end{figure}

 The measures developed for selecting the best split are often based on the
 degree of impurity of the child nodes. The smaller the degree of impurity,
 the more skewed the class distribution.  For example, a node with class
 distribution $(0, 1)$ has zero impurity, whereas a node with uniform class
 distribution $(0.5, 0.5)$ has the highest impurity. Examples of impurity
 measures include

 \begin{itemize}
     \item Cross-entropy: $- \displaystyle \sum_{i=1}^{C}p(i|A)\log_2 p(i|A)$
     \item Gini: $1 - \displaystyle \sum_{i=1}^{C} \left(p(i|A)\right)^2$
     \item Classification error: $1 - \displaystyle \max_{i \in 1\ldots C} p(i|A)$,
 \end{itemize}
 where $C$ is the number of classes and $p(i|A)$ denotes the proportion of records
 belonging to class $i$ at a given node $A$.

 Figure~\ref{fig:impurity_measures} compares the values of the impurity measures
 for binary classification problems. p refers to the fraction of records that
 belong to one of the two classes. Observe that all three measures attain their
 maximum value when the class distribution is uniform (i.e., when p = 0.5). The
 minimum values for the measures are attained when all the records belong to the
 same class (i.e., when p equals 0 or 1).

 A split for a node is computed by how much `impurity' is reduced by that split.
 For example, if we use the Gini index for selecting the split, we obtain the best
 binary split ($D_1$ and $D_2$) that maximizes the Gini gain as,
 $$
     G_\text{gini}(D; D_1, D_2)= g(D) - \left( \frac{|D_1|}{|D|} g(D_1) +
                                               \frac{|D_2|}{|D|} g(D_2) \right),
 $$ where $g(D)$ is the Gini impurity metric.

 For ordered, continuous attributes, we create splits of the form $X_i < v$,
 where $X_i$ is an attribute and $v$ is a value in the domain of $X_i$, allowing
 possibly infinite such splits. To reduce the computational load, we compute
 quantile boundaries of the attribute and test for splits on these boundary
 values.

 For categorical attributes, we create splits of the form
 $X_i \in \{v_1, v_2, v_3, \ldots\}$, where $\{v_1, v_2, v_3, \ldots\}$ is
 a subset of all possible values of the categorical attribute. For ordered,
 categorical attributes (integer input), the split points are obtained by sorting
 the categorical values and each increasing subsequence is then used as a
 possible subset for the split. For unordered, categorical attribute, the values
 are sorted by the entropy of the target variable. In the case of binary
 classification, this implies that values are sorted by the proportion of labels
 of the primary class. For multinomial classification, we sort by the entropy of
 each label i.e. we compute
 $-\displaystyle \sum_{i=1}^{C}p(i|X_i=v)\log_2 p(i|X_i=v)$
 for each value ($v$) of the attribute ($X_i$) and sort in increasing order.
 The splits are then obtained by evaluating each increasing subsequence of
 this sorted list.

 \subsection{Stopping Criteria} % (fold)
 \label{sub:stopping_criteria}
 The growing phase continues until a stopping criterion is triggered.
 The following conditions are some of the stopping rules used:
 \begin{itemize}
     \item All instances in the training set belong to a single value of target variable.
     \item The maximum tree depth has been reached (\emph{max\_depth}).
     \item The number of cases in the terminal node is less than the minimum
     number of cases for parent nodes (\emph{min\_split}).
     \item If the node were split, the number of cases in one or more child nodes
     would be less than the minimum number of cases for child nodes (\emph{min\_bucket}).
     % \item The best splitting criteria is not greater than a certain threshold (\emph{cp}).
 \end{itemize}
 % subsection stopping_criteria (end)

 \subsection{Missing data} % (fold)
 \label{sub:missing_predictor_data}
 \textbf{(Note: Not implemented in the initial version.)}

 Suppose our data has some missing predictor values in some or all of the
 variables. We might discard any observation with some missing values, but this
 could lead to serious depletion of the training set. Alternatively we might try
 to fill in (impute) the missing values, with say the mean of that predictor over
 the nonmissing observations. For tree-based models, there are two better
 approaches. The first is applicable to categorical predictors: we simply make a
 new category for ``missing''. From this we might discover that observations with
 missing values for some measurement behave differently than those with
 non\-missing values. The second more general approach is the construction of
 surrogate variables. When considering a predictor for a split, we use only the
 observations for which that predictor is not missing. Having chosen the best
 (primary) predictor and split point, we form a list of surrogate predictors and
 split points. The first surrogate is the predictor and corresponding split point
 that best mimics the split of the training data achieved by the primary split.
 The second surrogate is the predictor and corresponding split point that does
 second best, and so on. When sending observations down the tree either in the
 training phase or during prediction, we use the surrogate splits in order, if
 the primary splitting predictor is missing. Surrogate splits exploit
 correlations between predictors to try and alleviate the effect of missing data.
 The higher the correlation between the missing predictor and the other
 predictors, the smaller the loss of information due to the missing value.

 The resemblance between two binary splits over sample $S$ is formally defined as:
 \begin{align*}
     & sim(a_i, dom_1(a_i), dom_2(a_i), a_j, dom_1(a_j), dom_2(a_j), S) = \\
      & \ |\sigma_{a_i \in dom_1(a_i) and a_j \in dom_1(a_j)} S| +
                 |\sigma_{a_i \in dom_2(a_i) and a_j \in dom_2(a_j)} S|
 \end{align*}

 \subsubsection{Implementation considerations} % (fold)
 \label{ssub:surrogates_impl}
 For implementing surrogates the following should be considered:

 \begin{enumerate}
     \item Training surrogates: There are two options to consider for training
     of surrogates for each split:
     \begin{enumerate}
         \item Compute surrogates after the whole tree has trained. Advantage:
         the increase in training time will be equivalent to increasing number of
         iterations by one. Disadvantage: we won't have surrogates
         for a level while training the sub-levels. Hence, all rows with NULL will
         be ignored during the training.
         \item Compute surrogates splits for a node while that node is being trained.
         Advantage: during training, we have surrogate splits for all nodes above
         the current trained level, implying that features with NULL rows are
         ignored only while training nodes that use that feature - all nodes below
         will still get to use that row. Disadvantage: the training doubles, since
         every iteration will have to first train the nodes in level, and then
         compute the surrogates for that node.
     \end{enumerate}
     \item Storage for surrogates: We store an array of surrogate variables and
         their corresponding split thresholds for each node. In abstract, all the
         surrogates would be stored as two matrices: one for each surrogate feature
         index and another for corresponding threshold.
     \item Update to search using surrogates: During searching the tree for a
         incoming tuple we go through the tree as usual. If for a node, the feature
         to check is NULL in the tuple, we go through the surrogate splits. If
         at least one of the surrogate split features has a non-NULL value, we
         use it search further. If an observation is missing all the surrogates
         we use the majority branch to search further.

     \item Displaying surrogate splits: we can display the surrogate splits for
     each split in a separate table - this can implemented as a function call
     for the user. The table would have a row for each node, with a column for
     each surrogate split variable. Along with the surrogate split we also display
     the aggrement measure that indicates how similar the two splits are.
 \end{enumerate}

 It's important to note that, similar to the `rpart' package, no surrogate that
 does worse than ``go with the majority'' is printed or used. The ``go with the majority''
 rule, also referred to as the ``blind'' rule computes the agreement measure when
 we always go with the branch that has majority of the tuples. Computing this
 blind rule measure is easy since we already have the statistics of tuples
 going left and right from our previous training of the node.

 % subsubsection surrogates_impl (end)

 \subsection{Pruning} % (fold)
 \label{sub:pruning}

 See~\cite{hastie2008statisticallearning} for details on various
 theoretical explanations in this section.

 We use the same method used by R's rpart package to prune the tree.
 The method is a little different from what is normally described in
 textbooks.

 It is often observed that a decision tree perfect on the training set,
 will have a worse generalization ability than a tree which is
 not-so-good on the training set; this is called \textbf{overfitting}
 which may be caused by the fact that some peculiarities of the
 training data, such as those caused by noise in collecting training
 examples, are misleadingly recognized by the learner as the underlying
 truth. To reduce the risk of overfitting, a general strategy is to
 employ pruning to cut off some tree branches caused by noise or
 peculiar- ities of the training set. \textbf{Pre-pruning} tries to
 prune branches when the tree is being grown, while
 \textbf{post-pruning} re-examines fully grown trees to decide which
 branches should be removed. When a validation set is available, the
 tree can be pruned according to the validation error: for pre-pruning,
 a branch will not be grown if the validation error will increase by
 growing the branch; for post-pruning, a branch will be removed if the
 removal will decrease the validation error.

 To perform pruning, we define a misclassification (resubstitution) error of the
 tree, which is the number of misclassified entries for a classification tree and
 the mean-squared error for a regression tree. We can estimate this
 error or risk ($R^*$)
 for classification tree as,
 \begin{equation}
     \label{eq:def_risk}
     R^*(T) = \sum_{t \in \tilde{T}} r(t) p(t),
 \end{equation}

 where $p(t)$ is the proportion of points in terminal node $t$ (member of
 terminal nodes set $\tilde{T}$) and $r(t)$ is the probability of making wrong
 classification for points in node $t$. For a point in a given leaf node $t$, the
 estimated probability of misclassification is 1 minus the probability of the
 majority class in node $t$ based on the training data.

 For a regression tree this would be estimated as
 $$
     R^*(t) = \sum_{t \in \tilde{T}} Var(t),
 $$
 where $Var(t)$ is the variance of the dataset assigned to node $t$.

 It is easy to prove that the resubstitution error rate $R(T)$ is biased downward,
 i.e. it tends to produce bigger trees. Hence, we need to add a complexity penalty
 to this resubstitution error rate. The penalty term favors smaller trees, and
 hence balances with $R(T)$.

 For any subtree $T < T_{\text{max}}$ , we will define its complexity as
 $|\tilde{T}|$, the number of terminal or leaf nodes in $T$. Let $\alpha \ge 0$
 be a real number called the `complexity parameter' and define the cost-complexity
 measure $R_{\alpha}(T)$ as

 \begin{equation}
     \label{eq:cp}
     R_{\alpha}(T) = R(T) + \alpha |\tilde{T}|\cdot R(T_1),
 \end{equation}

 where $R(T_1)$ is the risk computed for the tree with only one root
 node. Note that textbooks usually do not have $R(T_1)$ in the
 definition of the cost-complexity. According to the R package 'rpart',
 the above scaled version is unitless and thus much more user friendly
 than the original CART formular where there is no $R(T_1)$.

 Because we are using this scaled version, we can simplify the
 computation of risk in the case of classification Eq.
 (\ref{eq:def_risk}). In Eq. (\ref{eq:def_risk}), we have $r =
 n_{mis}/n$ and $p = n/N$, where $n_{mis}$ is the number of
 mis-classified records, $n$ is the number records classified into the
 current node and $N$ is the total number of records. Clearly in Eq.
 (\ref{eq:cp}), we can directly use $n_{mis}$ instead of $r \times p$.

 When there is a weight for each record, we should use weight in the
 calculation. For example, $n_{mis}$ should be replaced by the sum of
 weights of the mis-classified records.

 There is some logic in rpart's code that is not described anywhere. We
 try to give more detailed explanation here.

 (1) For a node $i$, we compute the risk of this node $r_i$, and the
 total risk $s_i$ of all nodes in the subtree where the node $i$ is the
 root node and not included in the subtree. The number of splits in
 subtree-$i$ is $m_i = m_{i,left} + m_{i,right} + 1$. Here the
 subscript "left" and "right" denote the left and right child of node
 $i$. Another important quatity is the cost-complexity itself $c_i$.

 (2) Instead of directly applying Eq. (\ref{eq:cp}), we use the
 following logic: For a node, we compute the average improvement per
 split in the subtree of node $i$.
 \begin{equation}
     \bar{t} = \frac{r_i - s_{i,left} - s_{i,right}}{m_i}.
 \end{equation}
 If $\bar{t} > c_{i,right} > c_{i,left}$, the improvement at this split
 is more important than the splits in the sub-tree, since we already
 keep the nodes in the sub-tree, we must keep the current split. Thus,
 we re-compute the average improvement as the following
 \begin{equation}
     \bar{t} = \frac{r_i - r_{i,left} - s_{i,right}}{m_{i,right} + 1} .
 \end{equation}
 And then if $\bar{t} > c_{i,right} $, we need to do the same thing and
 update
 \begin{equation}
     \bar{t} = r_i - r_{i,left} - r_{i,right} .
 \end{equation}
 And when $\bar{t} > \alpha R(T_1) $, we keep the split.

 We use a similar method if instead $\bar{t} > c_{i,left} >
 c_{i,right}$, then we first deal with the right child, and then deal
 with the left child.

 (3) We recursively call the function of pruning. The question is when
 we stop. We stop when the current node under examination is a leaf
 node, or the risk is smaller than the threshold $\alpha\times R(T_1)$
 (no change can make the improvement larger than this threshold).
 However, rpart uses a fairly complicated method to estimate the risk
 that is used to compare with the threshold. See the code for more
 details.

 At the end, the cost complexity measure comes as a penalized version
 of the resubstitution error rate. This is the function to be minimized
 when pruning the tree. In general, given a pre-selected $\alpha$, we
 can find the subtree $T(\alpha)$ that minimizes $R_\alpha(T)$. Since
 there are at most a finite number of subtrees of $T_{\text{max}}$,
 $R_\alpha(T(\alpha))$ yields different values for only finitely many
 $\alpha$'s. T($\alpha$) continues to be the minimizing tree when
 $\alpha$ increases until a jump point is reached.

 \subsection{Cross-validation with the cost-complexity}

 The starting point for the pruning is not $T_{\text{max}}$, but rather
 $T_1 = T(0)$, which is the smallest subtree of $T_{\text{max}}$ satisfying
 $R(T_1) = R(T_{\text{max}})$.

 First, look at the biggest tree, $T_{\text{max}}$, and for any two terminal
 nodes descended from the same parent, for instance $t_L$ and $t_R$, if they
 yield the same resubstitution error rate as the parent node $t$, prune off these
 two terminal nodes, that is, if $R(t) = R(t_L) + R(t_R)$, prune off $t_L$ and
 $t_R$. This process is applied recursively. After we have pruned one pair of
 terminal nodes, the tree shrinks a little bit. Then based on the smaller tree,
 we do the same thing until we cannot find any pair of terminal nodes satisfying
 this equality. The resulting tree at this point is $T_1$.

 We now find the next $\alpha$ using the following method. The new
 $\alpha$ will result in a different optimal subtree.

 For any node $t \in T_1$, we can set $R_\alpha({t}) = R(t) + \alpha$, and for
 the subtree starting at $t$, we can define $R_\alpha(T_t) = R(T_t) + \alpha |\tilde{T_t}|$.
 When $\alpha = 0, R_0(T_t) < R_0({t})$, but when $\alpha$ increases there is a point
 at which $R_\alpha(T_t) = R_\alpha({t})$. We compute this exact alpha value by minimizing the
 function
 \begin{equation*}
     g_1(t) = \begin{cases}
                 \frac{R(t) - R(T_t)}{|\tilde{T_t}| - 1},  & \mbox{if } t \notin \tilde{T_1} \\
                 \infty, & \mbox{if } t \in \tilde{T_1}.
              \end{cases}
 \end{equation*}

 We find the weakest link $\bar{t_1} \in T_1$ that achives the minimum of
 $g_1(t)$  and set the new $\alpha_2 = g_1(t)$. To get the optimal subtree $T_2$,
 simply remove the branch growing out of $\bar{t_1}$. If there are several nodes
 that simultaneously achieve the minimum $g_1(t)$, we remove the branch grown out of
 each of these nodes. We repeat this process till we end up with only the root
 node i.e. a single-split tree.

 To find which of the multiple optimal subtrees is the best one, we can run
 cross-validation for the optimal subtree of each $\alpha$ to obtain an average
 test resubstitution error for each optimal subtree and pick the one with the
 lowest error.
 % subsection pruning (end)

 % subsection missing_predictor_data (end)
 % subsection impurity_metrics (end)
 % section cart (end)
 % section training_algorithm_description (end)

 \section{Parallelism} % (fold)
 \label{sec:parallelism}

 Let $X = \{X_1, X_2, \dots, X_N\}$ be a set of attributes with domains
 $D_{X_1}, D_{X_2}, \dots, D_{X_N}$. Let $Y$ be an output with domain $D_Y$.
 Given a dataset $D^* = \{(x_i, y_i) | x_i \in D_{X_1} \times D_{X_2} \times \dots D_{X_N}, y_i \in D_Y\}$,
 the goal is to learn a tree model that best approximates the true
 distribution of $D^*$. However, finding the optimal tree is a NP-Hard problem,
 thus most decision tree construction algorithms use a greedy top-down approach
 as described in Algorithm~\ref{alg:buildsubtree}.

 \begin{algorithm}[BuildSubtree$(n, D)$] \label{alg:buildsubtree}
 \alginput{Node $n$, Data $D \in D^*$}
 \algoutput{A Decision Tree Model}
 \begin{algorithmic}[1]
     \State $\left(n \rightarrow split, D_L, D_R \right) = \text{FindBestSplit}(D)$
     \If{StoppingCriteria$(D_L)$}
         \State $n \rightarrow \text{left\_prediction} = \text{FindPrediction}(D_L)$
     \Else
         \State $\text{BuildSubtree}(n \rightarrow \text{left}, D_L)$
     \EndIf
     \If{StoppingCriteria$(D_R)$}
         \State $n \rightarrow \text{right\_prediction} = \text{FindPrediction}(D_R)$
     \Else
         \State $\text{BuildSubtree}(n \rightarrow \text{right}, D_R)$
     \EndIf
 \end{algorithmic}
 \end{algorithm}

 %From the algorithm \ref{alg:build-subtree}, there are three major steps
 %\begin{itemize}
 %\item[1.] Initialization, which includes finding split candidates.
 %\item[2.] Find best split for each node
 %\item[3.] Find prediction for leaf nodes in the decision tree.
 %\end{itemize}

 Note that the each call of Algorithm~\ref{alg:buildsubtree} requires at lease
 one pass over the dataset $D^*$, which means that the number of passes needed
 for training a depth $l$ tree is $O(2^l)$. The node-wise training incurs IO
 cost that is exponential to tree depth which does not scale to large dataset.

 To scale to large dataset, in our algorithm, we perform level-wise training
 strategy as proposed in \cite{panda2009planet}. With each
 pass over the training dataset, we find best splits for all nodes at the same
 level. Moreover, to reduce search space in finding best splits, we construct
 equal-depth binning for each attributes and only consider bin boundaries as split
 candidates. In what follows, we will describe details of each step in level-wise
 decision tree training algorithm. The following assumptions hold in the level-wise
 training algorithm.

 \begin{itemize}
 \item[1.] The tree model can fit in memory
 \item[2.] Dataset is distributed over multiple machines
 \end{itemize}

 \subsection{Initialization}
 In the initialization step, we find split candidates by equal-depth binning on
 each attribute. For continuous attributes, on large distributed datasets, we
 perform a quantile calculation over a sampled fraction of the data to get an
 approximation set of split candidates. The ordered splits create bins and the
 maximum number of such bins can be specified by user. Note that the number of
 bins cannot be greater than the number of training examples. The tree algorithm
 automatically reduces the number of bins if the condition is not satisfied.

 For $M$ categorical attributes and binary classification, the number of split
 candidates can be reduced to $M-1$ by ordering the categorical attribute values
 by the proportion of labels falling in one of the two classes. For example, for
 a binary classification problem with one categorical feature with three
 categories $A$, $B$ and $C$ with corresponding proportion of label $1$ as $0.2$,
 $0.6$ and $0.4$,the categorical features are ordered as $A$ followed by $C$
 followed by $B$. The two split candidates are $A|C,B$ and $A,C|B$
 where $|$ denotes the split. For categorical variables in multiclass classification,
 each bin is a category. The bins are sorted and they are ordered by calculating
 the impurity of their corresponding labels.

 Algorithm~\ref{alg:findSplitBins} describes binning for CART.

 \begin{algorithm}[findSplitBins$(D^*, MB)$] \label{alg:findSplitBins}
 \alginput{Training dataset $D^*$, max bins for each feature $MB$}
 \algoutput{A equal-depth binning for each attributes in $D$}
 \begin{algorithmic}[1]
     \If {$MB < \left|D^*\right|$}
         \State $numBins = MB$
     \Else
         \State $numBins = \left|D^*\right|$
     \EndIf
     \If {$numBins * numBins < \left|D^*\right|$}
         \State Sample fraction $f = numBins * numBins/{\left|D^*\right|}$
     \Else
         \State Sample fraction $f = 1.0$
     \EndIf
     \State Randomly select $f\left| D^* \right|$ records $R$ from $D^*$ and load $R$ into memory
     \ForAll {$attr$ in $D^*$}
         \If {$attr$ is continuous}
             \State sort $R$ by $attr$, perform quantile computation
         \Else
             \If {binary classification}
                 \State compute centroid of labels for each value in $attr$
                 \State sort $attr$ by centroid
             \Else
                 \State compute impurity for each label for each value in $attr$
                 \State sort $attr$ by impurity
             \EndIf
         \EndIf
     \EndFor
 \end{algorithmic}
 \end{algorithm}

 \subsection{Find best splits}
 For a particular split node, we want to find an attribute that maximizes purity.
 In classification, we use Gini impurity and entropy as our impurity measure and
 in regression, we use variance as the impurity measure.

 For classification, we need to compute the frequency $f_i$ for label $i$ at each
 node to compute Gini impurity and entropy.

 For regression, we need to compute
 \begin{align*}
     \text{Var}(D) = \frac{1}{n} \sum_{i} y_i^2 - \left(\frac{1}{n} \sum_{i}y_{i} \right)
 \end{align*}
 where $y_{i}$ is value for attribute $Y$ and $n$ is the number of examples in $D$.
 One important observation is that both label frequency and variance can be
 computed from sufficient statistics.

 For label frequency, the sufficient statistics we aggregate when passing over
 the training examples are
 \begin{itemize}
 \item The number of examples with label $i$, $N_i = \sum_i I(i)$
 \end{itemize}
 For variance, the sufficient statistics we aggregate when passing over the
 training examples are
 \begin{itemize}
 \item The number of training $N = \sum_i 1$
 \item The sum of values $S = \sum_i y_i$
 \item The sum of square values $Q = \sum_i y_i^2$
 \end{itemize}

 Note that the computations of the above sufficient statistics are associative,
 we then have the following high level algorithm for parallel decision tree
 construction. Assume that we are training nodes at level $l$, in each segment,
 we store the aggregated values for sufficient statistics that we need to compute
 purity. As we pass over each record $r$ in each segment, we find the
 corresponding bin for each attributes in $r$ and add to the aggregated
 statistics. When all segments finish data processing, the aggregated
 sufficient statistics will be merged. Once we have the overall aggregated
 sufficient statistics, we find the split that maximizes impurity for each node
 at the level $l$.

 \begin{algorithm}[findBestSplits$(D^*, M, n_l, B)$] \label{alg:findBestSplits}
 \alginput{Training dataset $D^*$,\\
           Tree model $M$, \\
           Set of nodes at level $l$, $n_l$, Binning info $B$}
 \algoutput{split that maximizes impurity for each node in $n_l$}
 \begin{algorithmic}[1]
     \State On each segment, perform the following \Comment{Transition function}
     \For{$d \in D^*$}
         \For{$attr\_value \in d$}
             \State Find the bin index in $B$ that $attr\_value$ belongs to
             \State Add sufficient statistics $s$ to aggregate
         \EndFor
     \EndFor
     \State Merge aggregated statistics on different segments \Comment{Merge function}
     \State Find splits that maximize impurity for each node in $n_l$ \Comment{Final function}
 \end{algorithmic}
 \end{algorithm}
 For Algorithm~\ref{alg:findBestSplits} to handle weights for each training example, when we add sufficient statistics $s$ to aggregate,
 we instead add $w \times s$ to aggregage, where $w$ is the weight for the traning example.

 % section parallelism (end)