| % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines. |
| % !TEX root = ../design.tex |
| |
| \chapter[Decision trees]{Decision Trees: Classification and Regression} |
| \begin{moduleinfo} |
| \item[Authors] {Rahul Iyer and Liquan Pei} |
| \item[History] |
| \begin{modulehistory} |
| \item[v0.2] Parallelism |
| \item[v0.1] Initial version: introduction, theory, and interface |
| \end{modulehistory} |
| \end{moduleinfo} |
| |
| % Abstract. What is the problem we want to solve? |
| \section{Introduction} % (fold) |
| \label{sec:introduction} |
| |
| |
| \emph{Notes, examples and figures in this section are borrowed from |
| \cite{hastie2008statisticallearning} and \cite{james2013statisticallearning}}. |
| |
| |
| Tree-based methods for regression and classification involve stratifying or |
| segmenting the predictor space into a number of simple regions. In order to make |
| a prediction for a given observation, we typically use the mean or the mode of |
| the training observations in the region to which it belongs. Since the set of |
| splitting rules used to segment the predictor space can be summarized in a tree, |
| these types of approaches are known as decision tree methods. Tree-based methods |
| are simple and useful for interpretation. However, they typically are not |
| competitive with the best supervised learning approaches, in terms of |
| prediction accuracy. The results from multiple decision tree are usually combined |
| to yield a single consensus prediction often resulting in dramatic improvements |
| in prediction accuracy, at the expense of some loss in interpretation. |
| |
| \subsection{Basics of Decision Trees} % (fold) |
| \label{sub:basics_of_decision_trees} |
| In order to motivate tree methods, we begin with a simple example. |
| |
| \begin{figure}[h!tb] |
| \begin{center} |
| \includegraphics[width=0.8\textwidth]{figures/basics_decision_tree} |
| \end{center} |
| \caption{(Figure from \cite{hastie2008statisticallearning}.) |
| Top right panel shows a partition of a two-dimensional |
| feature space by recursive binary splitting, applied to |
| some arbitrary data. Top left panel shows a general partition that cannot be |
| obtained from recursive binary splitting. Bottom left panel shows the tree |
| corresponding to the partition in the top right panel, and a perspective |
| plot of the prediction surface appears in the bottom right panel.} |
| \label{fig:basics_decision_tree} |
| \end{figure} |
| Let's consider a regression problem with continuous response $Y$ and inputs |
| $X_1$ and $X_2$, each taking values in the unit interval. The top left panel of |
| Figure~\ref{fig:basics_decision_tree} shows a partition of the feature space by |
| lines that are parallel to the coordinate axes. In each partition element we can |
| model $Y$ with a different constant. However, there is a problem: although each |
| partitioning line has a simple description like $X_1 = c$, some of the resulting |
| regions are complicated to describe. |
| |
| To simplify matters, we restrict attention to recursive binary partitions like |
| that in the top right panel of Figure~\ref{fig:basics_decision_tree}. We first |
| split the space into two regions, and model the response by the mean of $Y$ in |
| each region. We choose the variable and split-point to achieve the best fit. |
| Then one or both of these regions are split into two more regions, and this |
| process is continued, until some stopping rule is applied. For example, in the |
| top right panel of Figure~\ref{fig:basics_decision_tree}, we first split at |
| $X_1 = t1$. Then the region $X1 \le t1$ is split at $X2 = t2$ and the region |
| $X1 > t1$ is split at $X1 = t3$. Finally,the region $X1 > t3$ is split at |
| $X2 = t4$. The result of this process is a partition into the five |
| regions $R_1, R_2, \ldots, R_5$ shown in the figure. The corresponding regression |
| model predicts $Y$ with a constant $c_m$ in region $R_m$, that is, |
| $$ |
| f(X) = \sum_{m=1}^{5} c_m I{(X_1, X_2) \in R_m}. |
| $$ |
| |
| This same model can be represented by the binary tree in the bottom left panel |
| of Figure~\ref{fig:basics_decision_tree}. The full dataset sits at the top of |
| the tree. Observations satisfying the condition at each junction are assigned to |
| the left branch, and the others to the right branch. The terminal nodes or |
| leaves of the tree correspond to the regions $R_1, R_2, \ldots, R_5$. The bottom |
| right panel is a perspective plot of the regression surface from this model. |
| |
| \paragraph{Prediction via Stratification of the Feature Space:} |
| |
| We now discuss the process of building a regression tree. Roughly speaking, |
| there are two steps. |
| \begin{enumerate} |
| \item We divide the predictor space - that is, the set of possible values for |
| $X1, X2, \ldots, Xp$ into $J$ distinct and non-overlapping regions, $R_1, R_2, |
| \ldots, R_J$. |
| \item For every observation that falls into the region $R_j$, we make the same |
| prediction, which is simply the mean of the response values for the training |
| observations in $R_j$. |
| \end{enumerate} |
| |
| \paragraph{Advantages of using tree-based models:} |
| |
| \begin{enumerate} |
| \item Trees are very easy to explain to people. |
| \item Some believe that decision trees more closely mirror human decision-making. |
| \item Trees can be displayed graphically, and are easily interpreted even by a non-expert. |
| \item Trees can easily handle qualitative predictors without the need to create dummy variables. |
| \end{enumerate} |
| % subsection basics_of_decision_trees (end) |
| |
| \subsection{Trees Versus Linear Models} % (fold) |
| \label{sub:trees_versus_linear_models} |
| \begin{figure}[h!tb] |
| \begin{center} |
| \includegraphics[width=0.7\textwidth]{figures/decision_tree_linear_model} |
| \end{center} |
| \caption{(Figure from \cite{james2013statisticallearning}.) |
| Top Row: A two-dimensional classification example in which the true decision |
| boundary is linear, and is indicated by the shaded regions. A classical approach |
| that assumes a linear boundary (left) will outperform a decision tree that |
| performs splits parallel to the axes (right). Bottom Row: Here the true decision |
| boundary is non-linear. Here a linear model is unable to capture the true |
| decision boundary (left), whereas a decision tree is successful (right).} |
| \label{fig:decision_trees_linear_models} |
| \end{figure} |
| |
| Regression and classification trees have a very different flavor from the more |
| classical approaches for regression presented in \ref{chap:regression}. |
| In particular, linear regression assumes a model of the form |
| $$ |
| f(X) = \beta_0 + \sum_{j=1}^{p} X_j \beta_j, |
| $$ |
| whereas regression trees assume a model of the form |
| $$ |
| f(X) = \sum_{m=1}^{M} c_m \cdot 1_{X \in R_m} |
| $$ |
| where $R_1, \ldots, R_M$ represent a partition of feature space. |
| The efficacy of the two models depend on the problem at hand, as illustrated by |
| Figure~\ref{fig:decision_trees_linear_models} |
| % subsection trees_versus_linear_models (end) |
| % section introduction (end) |
| |
| \section{Interface} % (fold) |
| \label{sec:interface} |
| The interface of the decision tree functionality includes two functions, |
| \texttt{tree\_train} and \texttt{tree\_predict}, described below. |
| |
| \subsection{Training} % (fold) |
| \label{sub:training} |
| \begin{sql} |
| SELECT tree_train( |
| training_table_name, |
| output_table_name, |
| id_col_name, |
| dependent_variable, |
| list_of_features, |
| list_of_features_to_exclude, |
| split_criterion, |
| grouping_cols, |
| weights, |
| max_depth, |
| min_split, |
| min_bucket, |
| surrogate_params, -- not implemented |
| pruning_params, |
| verbose |
| ) |
| \end{sql} |
| |
| \paragraph{Arguments:} |
| |
| \begin{itemize} |
| % \item \emph{training\_algorithm}: Can be either `c4.5' or `cart' |
| \item \emph{training\_table\_name}: Name of the table containing data. |
| \item \emph{output\_table\_name}: Name of the table to output the model. |
| \item \emph{id\_col\_name}: Name of column containing the id information in training data. |
| \item \emph{dependent\_variable}: Name of the column that contains the |
| output for training. Boolean, integer and text are considered classification |
| outputs, while float values are considered regression outputs. |
| \item \emph{list\_of\_features}: List of column names (comma-separated string) |
| to use as predictors. Can also be a `*' implying all columns are to be used |
| as predictors (except the ones included in the next argument). Boolean, integer, and |
| text columns are considered categorical columns. |
| \item \emph{list\_of\_features\_to\_exclude}: OPTIONAL. List of column names |
| (comma-separated string) to exlude from the predictors list. |
| \item \emph{split\_criterion}: OPTIONAL (Default = `gini'). Various options |
| to compute the feature to split a node. Available options are `gini', |
| `cross-entropy', and `misclassification' for classification. |
| For regression tree, the input for argument is ignored and the |
| split\_criterion is always set to `mse' (mean-squared error). |
| \item \emph{grouping\_cols}: OPTIONAL. List of column names (comma-separated string) to group |
| the data by. This will lead to creating multiple decision trees, one for each group. |
| \item \emph{weights}: OPTIONAL. Column name containing weights for each observation. |
| \item \emph{max\_depth}: OPTIONAL (Default = 10). Set the maximum depth of |
| any node of the final tree, with the root node counted as depth 0. |
| \item \emph{min\_split}: OPTIONAL (Default = 20). Minimum number of observations that |
| must exist in a node for a split to be attempted. |
| \item \emph{min\_bucket}: OPTIONAL (Default = minsplit/3). Minimum number of observations |
| in any terminal node. If only one of minbucket or minsplit is specified, |
| minsplit is set to minbucket*3 or minbucket to minsplit/3, as appropriate. |
| % \item \emph{max\_compete}: OPTIONAL (Default = 0) Number of competitor splits |
| % retained in the output. It is useful to know not just which split was chosen, |
| % but which variable came in second, third, etc. |
| \item \emph{n\_bins}: OPTIONAL (Default = 100) Number of bins to use during binning. |
| Continuous-valued features are binned into discrete bins (per the quantile values) |
| to compute split boundaries. This global parameter is used to compute the resolution of the bins. Higher number of bins will lead to higher processing time. |
| \item \emph{surrogate\_params}: (Not implemented) OPTIONAL (String with multiple key-value parameters, default = NULL) |
| \begin{enumerate} |
| \item \emph{max\_surrogate}: OPTIONAL (Default = 0) |
| The number of surrogate splits retained in the output. |
| If this is set to zero the compute time will be reduced by half. |
| % \item \emph{use\_surrogate}: OPTIONAL (Default = 1) Use of surrogates in splitting process. |
| % \begin{enumerate} |
| % \setcounter{enumi}{0} |
| % \item Display only, an observation with a missing value for the primary split rule |
| % is not sent further down the tree. |
| % \item Use surrogates to split subjects missing the primary variable; if all surrogates are missing the observation is not split. |
| % \item If all surrogates are missing, then send the observation in the majority direction. |
| % \end{enumerate} |
| % \item \emph{surrogate\_style}: OPTIONAL (Default = 0) |
| % \begin{enumerate} |
| % \setcounter{enumi}{0} |
| % \item Use total number of correct classification for a potential surrogate variable |
| % \item Use the percent correct, calculated over the non-missing values of the surrogate |
| % \end{enumerate} |
| \end{enumerate} |
| \item \emph{pruning\_params}: OPTIONAL (String with multiple key-value parameters, default = NULL) |
| \begin{enumerate} |
| \item \emph{cp}: OPTIONAL (Default = 0) |
| A complexity parameter that determines that a split is attempted only if it |
| decreases the overall lack of fit by a factor of `cp'. |
| \item \emph{n\_folds}: OPTIONAL (Default = 0; no cross-validation) Number of cross-validation folds |
| \end{enumerate} |
| \item \emph{verbose}: OPTIONAL (Default = False) Prints status information |
| on the splits performed and any other information useful for debugging. |
| \end{itemize} |
| % subsection training (end) |
| |
| \subsection{Prediction} % (fold) |
| \label{sub:prediction} |
| \begin{sql} |
| SELECT tree_predict( |
| tree_model, |
| new_data_table, |
| output_table, |
| type |
| ) |
| \end{sql} |
| |
| \paragraph{Arguments:} |
| |
| \begin{itemize} |
| \item \emph{tree\_model}: Name of the table containing the decision tree model. |
| \item \emph{new\_data\_table}: Name of table containing prediction data. |
| \item \emph{output\_table}: Name of table to output prediction results. |
| \item \emph{type}: OPTIONAL (Default = `response'). For regression trees, |
| `response', implies output is the predicted value. For classification trees, |
| this can be `response', giving the classification prediction as output, or |
| `prob', giving the class probabilities as output |
| (for two classes, only a single probability value is output that corresponds |
| to the first class when the two classes are sorted by name; |
| in case of more than two classes, an array of class probabilities |
| (a probability of each class) is output). |
| \end{itemize} |
| % subsection prediction (end) |
| % section interface (end) |
| \section{CART} % (fold) |
| \label{sec:cart} |
| CART stands for Classification and Regression Trees (\cite{breiman1984cart}). It |
| is characterized by the fact that it constructs binary trees, namely each |
| internal node has exactly two outgoing edges. The splits can be selected using |
| any of the impurity metrics criteria and the obtained tree is pruned by |
| cost-complexity pruning. |
| |
| \subsection{Impurity metrics} % (fold) |
| \label{sub:impurity_metrics} |
| \begin{figure}[h!tb] |
| \begin{center} |
| \includegraphics[width=0.65\textwidth]{figures/impurity_measures} |
| \end{center} |
| \caption{(Figure obtained from \cite{hastie2008statisticallearning}) |
| Node impurity measures for two-class classification, as a function of the |
| proportion $p$ in class 2. Cross-entropy has been scaled to pass through (0.5, 0.5).} |
| \label{fig:impurity_measures} |
| \end{figure} |
| |
| The measures developed for selecting the best split are often based on the |
| degree of impurity of the child nodes. The smaller the degree of impurity, |
| the more skewed the class distribution. For example, a node with class |
| distribution $(0, 1)$ has zero impurity, whereas a node with uniform class |
| distribution $(0.5, 0.5)$ has the highest impurity. Examples of impurity |
| measures include |
| |
| \begin{itemize} |
| \item Cross-entropy: $- \displaystyle \sum_{i=1}^{C}p(i|A)\log_2 p(i|A)$ |
| \item Gini: $1 - \displaystyle \sum_{i=1}^{C} \left(p(i|A)\right)^2$ |
| \item Classification error: $1 - \displaystyle \max_{i \in 1\ldots C} p(i|A)$, |
| \end{itemize} |
| where $C$ is the number of classes and $p(i|A)$ denotes the proportion of records |
| belonging to class $i$ at a given node $A$. |
| |
| Figure~\ref{fig:impurity_measures} compares the values of the impurity measures |
| for binary classification problems. p refers to the fraction of records that |
| belong to one of the two classes. Observe that all three measures attain their |
| maximum value when the class distribution is uniform (i.e., when p = 0.5). The |
| minimum values for the measures are attained when all the records belong to the |
| same class (i.e., when p equals 0 or 1). |
| |
| A split for a node is computed by how much `impurity' is reduced by that split. |
| For example, if we use the Gini index for selecting the split, we obtain the best |
| binary split ($D_1$ and $D_2$) that maximizes the Gini gain as, |
| $$ |
| G_\text{gini}(D; D_1, D_2)= g(D) - \left( \frac{|D_1|}{|D|} g(D_1) + |
| \frac{|D_2|}{|D|} g(D_2) \right), |
| $$ where $g(D)$ is the Gini impurity metric. |
| |
| For ordered, continuous attributes, we create splits of the form $X_i < v$, |
| where $X_i$ is an attribute and $v$ is a value in the domain of $X_i$, allowing |
| possibly infinite such splits. To reduce the computational load, we compute |
| quantile boundaries of the attribute and test for splits on these boundary |
| values. |
| |
| For categorical attributes, we create splits of the form |
| $X_i \in \{v_1, v_2, v_3, \ldots\}$, where $\{v_1, v_2, v_3, \ldots\}$ is |
| a subset of all possible values of the categorical attribute. For ordered, |
| categorical attributes (integer input), the split points are obtained by sorting |
| the categorical values and each increasing subsequence is then used as a |
| possible subset for the split. For unordered, categorical attribute, the values |
| are sorted by the entropy of the target variable. In the case of binary |
| classification, this implies that values are sorted by the proportion of labels |
| of the primary class. For multinomial classification, we sort by the entropy of |
| each label i.e. we compute |
| $-\displaystyle \sum_{i=1}^{C}p(i|X_i=v)\log_2 p(i|X_i=v)$ |
| for each value ($v$) of the attribute ($X_i$) and sort in increasing order. |
| The splits are then obtained by evaluating each increasing subsequence of |
| this sorted list. |
| |
| \subsection{Stopping Criteria} % (fold) |
| \label{sub:stopping_criteria} |
| The growing phase continues until a stopping criterion is triggered. |
| The following conditions are some of the stopping rules used: |
| \begin{itemize} |
| \item All instances in the training set belong to a single value of target variable. |
| \item The maximum tree depth has been reached (\emph{max\_depth}). |
| \item The number of cases in the terminal node is less than the minimum |
| number of cases for parent nodes (\emph{min\_split}). |
| \item If the node were split, the number of cases in one or more child nodes |
| would be less than the minimum number of cases for child nodes (\emph{min\_bucket}). |
| % \item The best splitting criteria is not greater than a certain threshold (\emph{cp}). |
| \end{itemize} |
| % subsection stopping_criteria (end) |
| |
| \subsection{Missing data} % (fold) |
| \label{sub:missing_predictor_data} |
| \textbf{(Note: Not implemented in the initial version.)} |
| |
| Suppose our data has some missing predictor values in some or all of the |
| variables. We might discard any observation with some missing values, but this |
| could lead to serious depletion of the training set. Alternatively we might try |
| to fill in (impute) the missing values, with say the mean of that predictor over |
| the nonmissing observations. For tree-based models, there are two better |
| approaches. The first is applicable to categorical predictors: we simply make a |
| new category for ``missing''. From this we might discover that observations with |
| missing values for some measurement behave differently than those with |
| non\-missing values. The second more general approach is the construction of |
| surrogate variables. When considering a predictor for a split, we use only the |
| observations for which that predictor is not missing. Having chosen the best |
| (primary) predictor and split point, we form a list of surrogate predictors and |
| split points. The first surrogate is the predictor and corresponding split point |
| that best mimics the split of the training data achieved by the primary split. |
| The second surrogate is the predictor and corresponding split point that does |
| second best, and so on. When sending observations down the tree either in the |
| training phase or during prediction, we use the surrogate splits in order, if |
| the primary splitting predictor is missing. Surrogate splits exploit |
| correlations between predictors to try and alleviate the effect of missing data. |
| The higher the correlation between the missing predictor and the other |
| predictors, the smaller the loss of information due to the missing value. |
| |
| The resemblance between two binary splits over sample $S$ is formally defined as: |
| \begin{align*} |
| & sim(a_i, dom_1(a_i), dom_2(a_i), a_j, dom_1(a_j), dom_2(a_j), S) = \\ |
| & \ |\sigma_{a_i \in dom_1(a_i) and a_j \in dom_1(a_j)} S| + |
| |\sigma_{a_i \in dom_2(a_i) and a_j \in dom_2(a_j)} S| |
| \end{align*} |
| |
| \subsubsection{Implementation considerations} % (fold) |
| \label{ssub:surrogates_impl} |
| For implementing surrogates the following should be considered: |
| |
| \begin{enumerate} |
| \item Training surrogates: There are two options to consider for training |
| of surrogates for each split: |
| \begin{enumerate} |
| \item Compute surrogates after the whole tree has trained. Advantage: |
| the increase in training time will be equivalent to increasing number of |
| iterations by one. Disadvantage: we won't have surrogates |
| for a level while training the sub-levels. Hence, all rows with NULL will |
| be ignored during the training. |
| \item Compute surrogates splits for a node while that node is being trained. |
| Advantage: during training, we have surrogate splits for all nodes above |
| the current trained level, implying that features with NULL rows are |
| ignored only while training nodes that use that feature - all nodes below |
| will still get to use that row. Disadvantage: the training doubles, since |
| every iteration will have to first train the nodes in level, and then |
| compute the surrogates for that node. |
| \end{enumerate} |
| \item Storage for surrogates: We store an array of surrogate variables and |
| their corresponding split thresholds for each node. In abstract, all the |
| surrogates would be stored as two matrices: one for each surrogate feature |
| index and another for corresponding threshold. |
| \item Update to search using surrogates: During searching the tree for a |
| incoming tuple we go through the tree as usual. If for a node, the feature |
| to check is NULL in the tuple, we go through the surrogate splits. If |
| at least one of the surrogate split features has a non-NULL value, we |
| use it search further. If an observation is missing all the surrogates |
| we use the majority branch to search further. |
| |
| \item Displaying surrogate splits: we can display the surrogate splits for |
| each split in a separate table - this can implemented as a function call |
| for the user. The table would have a row for each node, with a column for |
| each surrogate split variable. Along with the surrogate split we also display |
| the aggrement measure that indicates how similar the two splits are. |
| \end{enumerate} |
| |
| It's important to note that, similar to the `rpart' package, no surrogate that |
| does worse than ``go with the majority'' is printed or used. The ``go with the majority'' |
| rule, also referred to as the ``blind'' rule computes the agreement measure when |
| we always go with the branch that has majority of the tuples. Computing this |
| blind rule measure is easy since we already have the statistics of tuples |
| going left and right from our previous training of the node. |
| |
| % subsubsection surrogates_impl (end) |
| |
| \subsection{Pruning} % (fold) |
| \label{sub:pruning} |
| |
| See~\cite{hastie2008statisticallearning} for details on various |
| theoretical explanations in this section. |
| |
| We use the same method used by R's rpart package to prune the tree. |
| The method is a little different from what is normally described in |
| textbooks. |
| |
| It is often observed that a decision tree perfect on the training set, |
| will have a worse generalization ability than a tree which is |
| not-so-good on the training set; this is called \textbf{overfitting} |
| which may be caused by the fact that some peculiarities of the |
| training data, such as those caused by noise in collecting training |
| examples, are misleadingly recognized by the learner as the underlying |
| truth. To reduce the risk of overfitting, a general strategy is to |
| employ pruning to cut off some tree branches caused by noise or |
| peculiar- ities of the training set. \textbf{Pre-pruning} tries to |
| prune branches when the tree is being grown, while |
| \textbf{post-pruning} re-examines fully grown trees to decide which |
| branches should be removed. When a validation set is available, the |
| tree can be pruned according to the validation error: for pre-pruning, |
| a branch will not be grown if the validation error will increase by |
| growing the branch; for post-pruning, a branch will be removed if the |
| removal will decrease the validation error. |
| |
| To perform pruning, we define a misclassification (resubstitution) error of the |
| tree, which is the number of misclassified entries for a classification tree and |
| the mean-squared error for a regression tree. We can estimate this |
| error or risk ($R^*$) |
| for classification tree as, |
| \begin{equation} |
| \label{eq:def_risk} |
| R^*(T) = \sum_{t \in \tilde{T}} r(t) p(t), |
| \end{equation} |
| |
| where $p(t)$ is the proportion of points in terminal node $t$ (member of |
| terminal nodes set $\tilde{T}$) and $r(t)$ is the probability of making wrong |
| classification for points in node $t$. For a point in a given leaf node $t$, the |
| estimated probability of misclassification is 1 minus the probability of the |
| majority class in node $t$ based on the training data. |
| |
| For a regression tree this would be estimated as |
| $$ |
| R^*(t) = \sum_{t \in \tilde{T}} Var(t), |
| $$ |
| where $Var(t)$ is the variance of the dataset assigned to node $t$. |
| |
| It is easy to prove that the resubstitution error rate $R(T)$ is biased downward, |
| i.e. it tends to produce bigger trees. Hence, we need to add a complexity penalty |
| to this resubstitution error rate. The penalty term favors smaller trees, and |
| hence balances with $R(T)$. |
| |
| For any subtree $T < T_{\text{max}}$ , we will define its complexity as |
| $|\tilde{T}|$, the number of terminal or leaf nodes in $T$. Let $\alpha \ge 0$ |
| be a real number called the `complexity parameter' and define the cost-complexity |
| measure $R_{\alpha}(T)$ as |
| |
| \begin{equation} |
| \label{eq:cp} |
| R_{\alpha}(T) = R(T) + \alpha |\tilde{T}|\cdot R(T_1), |
| \end{equation} |
| |
| where $R(T_1)$ is the risk computed for the tree with only one root |
| node. Note that textbooks usually do not have $R(T_1)$ in the |
| definition of the cost-complexity. According to the R package 'rpart', |
| the above scaled version is unitless and thus much more user friendly |
| than the original CART formular where there is no $R(T_1)$. |
| |
| Because we are using this scaled version, we can simplify the |
| computation of risk in the case of classification Eq. |
| (\ref{eq:def_risk}). In Eq. (\ref{eq:def_risk}), we have $r = |
| n_{mis}/n$ and $p = n/N$, where $n_{mis}$ is the number of |
| mis-classified records, $n$ is the number records classified into the |
| current node and $N$ is the total number of records. Clearly in Eq. |
| (\ref{eq:cp}), we can directly use $n_{mis}$ instead of $r \times p$. |
| |
| When there is a weight for each record, we should use weight in the |
| calculation. For example, $n_{mis}$ should be replaced by the sum of |
| weights of the mis-classified records. |
| |
| There is some logic in rpart's code that is not described anywhere. We |
| try to give more detailed explanation here. |
| |
| (1) For a node $i$, we compute the risk of this node $r_i$, and the |
| total risk $s_i$ of all nodes in the subtree where the node $i$ is the |
| root node and not included in the subtree. The number of splits in |
| subtree-$i$ is $m_i = m_{i,left} + m_{i,right} + 1$. Here the |
| subscript "left" and "right" denote the left and right child of node |
| $i$. Another important quatity is the cost-complexity itself $c_i$. |
| |
| (2) Instead of directly applying Eq. (\ref{eq:cp}), we use the |
| following logic: For a node, we compute the average improvement per |
| split in the subtree of node $i$. |
| \begin{equation} |
| \bar{t} = \frac{r_i - s_{i,left} - s_{i,right}}{m_i}. |
| \end{equation} |
| If $\bar{t} > c_{i,right} > c_{i,left}$, the improvement at this split |
| is more important than the splits in the sub-tree, since we already |
| keep the nodes in the sub-tree, we must keep the current split. Thus, |
| we re-compute the average improvement as the following |
| \begin{equation} |
| \bar{t} = \frac{r_i - r_{i,left} - s_{i,right}}{m_{i,right} + 1} . |
| \end{equation} |
| And then if $\bar{t} > c_{i,right} $, we need to do the same thing and |
| update |
| \begin{equation} |
| \bar{t} = r_i - r_{i,left} - r_{i,right} . |
| \end{equation} |
| And when $\bar{t} > \alpha R(T_1) $, we keep the split. |
| |
| We use a similar method if instead $\bar{t} > c_{i,left} > |
| c_{i,right}$, then we first deal with the right child, and then deal |
| with the left child. |
| |
| (3) We recursively call the function of pruning. The question is when |
| we stop. We stop when the current node under examination is a leaf |
| node, or the risk is smaller than the threshold $\alpha\times R(T_1)$ |
| (no change can make the improvement larger than this threshold). |
| However, rpart uses a fairly complicated method to estimate the risk |
| that is used to compare with the threshold. See the code for more |
| details. |
| |
| At the end, the cost complexity measure comes as a penalized version |
| of the resubstitution error rate. This is the function to be minimized |
| when pruning the tree. In general, given a pre-selected $\alpha$, we |
| can find the subtree $T(\alpha)$ that minimizes $R_\alpha(T)$. Since |
| there are at most a finite number of subtrees of $T_{\text{max}}$, |
| $R_\alpha(T(\alpha))$ yields different values for only finitely many |
| $\alpha$'s. T($\alpha$) continues to be the minimizing tree when |
| $\alpha$ increases until a jump point is reached. |
| |
| \subsection{Cross-validation with the cost-complexity} |
| |
| The starting point for the pruning is not $T_{\text{max}}$, but rather |
| $T_1 = T(0)$, which is the smallest subtree of $T_{\text{max}}$ satisfying |
| $R(T_1) = R(T_{\text{max}})$. |
| |
| First, look at the biggest tree, $T_{\text{max}}$, and for any two terminal |
| nodes descended from the same parent, for instance $t_L$ and $t_R$, if they |
| yield the same resubstitution error rate as the parent node $t$, prune off these |
| two terminal nodes, that is, if $R(t) = R(t_L) + R(t_R)$, prune off $t_L$ and |
| $t_R$. This process is applied recursively. After we have pruned one pair of |
| terminal nodes, the tree shrinks a little bit. Then based on the smaller tree, |
| we do the same thing until we cannot find any pair of terminal nodes satisfying |
| this equality. The resulting tree at this point is $T_1$. |
| |
| We now find the next $\alpha$ using the following method. The new |
| $\alpha$ will result in a different optimal subtree. |
| |
| For any node $t \in T_1$, we can set $R_\alpha({t}) = R(t) + \alpha$, and for |
| the subtree starting at $t$, we can define $R_\alpha(T_t) = R(T_t) + \alpha |\tilde{T_t}|$. |
| When $\alpha = 0, R_0(T_t) < R_0({t})$, but when $\alpha$ increases there is a point |
| at which $R_\alpha(T_t) = R_\alpha({t})$. We compute this exact alpha value by minimizing the |
| function |
| \begin{equation*} |
| g_1(t) = \begin{cases} |
| \frac{R(t) - R(T_t)}{|\tilde{T_t}| - 1}, & \mbox{if } t \notin \tilde{T_1} \\ |
| \infty, & \mbox{if } t \in \tilde{T_1}. |
| \end{cases} |
| \end{equation*} |
| |
| We find the weakest link $\bar{t_1} \in T_1$ that achives the minimum of |
| $g_1(t)$ and set the new $\alpha_2 = g_1(t)$. To get the optimal subtree $T_2$, |
| simply remove the branch growing out of $\bar{t_1}$. If there are several nodes |
| that simultaneously achieve the minimum $g_1(t)$, we remove the branch grown out of |
| each of these nodes. We repeat this process till we end up with only the root |
| node i.e. a single-split tree. |
| |
| To find which of the multiple optimal subtrees is the best one, we can run |
| cross-validation for the optimal subtree of each $\alpha$ to obtain an average |
| test resubstitution error for each optimal subtree and pick the one with the |
| lowest error. |
| % subsection pruning (end) |
| |
| % subsection missing_predictor_data (end) |
| % subsection impurity_metrics (end) |
| % section cart (end) |
| % section training_algorithm_description (end) |
| |
| \section{Parallelism} % (fold) |
| \label{sec:parallelism} |
| |
| Let $X = \{X_1, X_2, \dots, X_N\}$ be a set of attributes with domains |
| $D_{X_1}, D_{X_2}, \dots, D_{X_N}$. Let $Y$ be an output with domain $D_Y$. |
| Given a dataset $D^* = \{(x_i, y_i) | x_i \in D_{X_1} \times D_{X_2} \times \dots D_{X_N}, y_i \in D_Y\}$, |
| the goal is to learn a tree model that best approximates the true |
| distribution of $D^*$. However, finding the optimal tree is a NP-Hard problem, |
| thus most decision tree construction algorithms use a greedy top-down approach |
| as described in Algorithm~\ref{alg:buildsubtree}. |
| |
| \begin{algorithm}[BuildSubtree$(n, D)$] \label{alg:buildsubtree} |
| \alginput{Node $n$, Data $D \in D^*$} |
| \algoutput{A Decision Tree Model} |
| \begin{algorithmic}[1] |
| \State $\left(n \rightarrow split, D_L, D_R \right) = \text{FindBestSplit}(D)$ |
| \If{StoppingCriteria$(D_L)$} |
| \State $n \rightarrow \text{left\_prediction} = \text{FindPrediction}(D_L)$ |
| \Else |
| \State $\text{BuildSubtree}(n \rightarrow \text{left}, D_L)$ |
| \EndIf |
| \If{StoppingCriteria$(D_R)$} |
| \State $n \rightarrow \text{right\_prediction} = \text{FindPrediction}(D_R)$ |
| \Else |
| \State $\text{BuildSubtree}(n \rightarrow \text{right}, D_R)$ |
| \EndIf |
| \end{algorithmic} |
| \end{algorithm} |
| |
| %From the algorithm \ref{alg:build-subtree}, there are three major steps |
| %\begin{itemize} |
| %\item[1.] Initialization, which includes finding split candidates. |
| %\item[2.] Find best split for each node |
| %\item[3.] Find prediction for leaf nodes in the decision tree. |
| %\end{itemize} |
| |
| Note that the each call of Algorithm~\ref{alg:buildsubtree} requires at lease |
| one pass over the dataset $D^*$, which means that the number of passes needed |
| for training a depth $l$ tree is $O(2^l)$. The node-wise training incurs IO |
| cost that is exponential to tree depth which does not scale to large dataset. |
| |
| To scale to large dataset, in our algorithm, we perform level-wise training |
| strategy as proposed in \cite{panda2009planet}. With each |
| pass over the training dataset, we find best splits for all nodes at the same |
| level. Moreover, to reduce search space in finding best splits, we construct |
| equal-depth binning for each attributes and only consider bin boundaries as split |
| candidates. In what follows, we will describe details of each step in level-wise |
| decision tree training algorithm. The following assumptions hold in the level-wise |
| training algorithm. |
| |
| \begin{itemize} |
| \item[1.] The tree model can fit in memory |
| \item[2.] Dataset is distributed over multiple machines |
| \end{itemize} |
| |
| \subsection{Initialization} |
| In the initialization step, we find split candidates by equal-depth binning on |
| each attribute. For continuous attributes, on large distributed datasets, we |
| perform a quantile calculation over a sampled fraction of the data to get an |
| approximation set of split candidates. The ordered splits create bins and the |
| maximum number of such bins can be specified by user. Note that the number of |
| bins cannot be greater than the number of training examples. The tree algorithm |
| automatically reduces the number of bins if the condition is not satisfied. |
| |
| For $M$ categorical attributes and binary classification, the number of split |
| candidates can be reduced to $M-1$ by ordering the categorical attribute values |
| by the proportion of labels falling in one of the two classes. For example, for |
| a binary classification problem with one categorical feature with three |
| categories $A$, $B$ and $C$ with corresponding proportion of label $1$ as $0.2$, |
| $0.6$ and $0.4$,the categorical features are ordered as $A$ followed by $C$ |
| followed by $B$. The two split candidates are $A|C,B$ and $A,C|B$ |
| where $|$ denotes the split. For categorical variables in multiclass classification, |
| each bin is a category. The bins are sorted and they are ordered by calculating |
| the impurity of their corresponding labels. |
| |
| Algorithm~\ref{alg:findSplitBins} describes binning for CART. |
| |
| \begin{algorithm}[findSplitBins$(D^*, MB)$] \label{alg:findSplitBins} |
| \alginput{Training dataset $D^*$, max bins for each feature $MB$} |
| \algoutput{A equal-depth binning for each attributes in $D$} |
| \begin{algorithmic}[1] |
| \If {$MB < \left|D^*\right|$} |
| \State $numBins = MB$ |
| \Else |
| \State $numBins = \left|D^*\right|$ |
| \EndIf |
| \If {$numBins * numBins < \left|D^*\right|$} |
| \State Sample fraction $f = numBins * numBins/{\left|D^*\right|}$ |
| \Else |
| \State Sample fraction $f = 1.0$ |
| \EndIf |
| \State Randomly select $f\left| D^* \right|$ records $R$ from $D^*$ and load $R$ into memory |
| \ForAll {$attr$ in $D^*$} |
| \If {$attr$ is continuous} |
| \State sort $R$ by $attr$, perform quantile computation |
| \Else |
| \If {binary classification} |
| \State compute centroid of labels for each value in $attr$ |
| \State sort $attr$ by centroid |
| \Else |
| \State compute impurity for each label for each value in $attr$ |
| \State sort $attr$ by impurity |
| \EndIf |
| \EndIf |
| \EndFor |
| \end{algorithmic} |
| \end{algorithm} |
| |
| \subsection{Find best splits} |
| For a particular split node, we want to find an attribute that maximizes purity. |
| In classification, we use Gini impurity and entropy as our impurity measure and |
| in regression, we use variance as the impurity measure. |
| |
| For classification, we need to compute the frequency $f_i$ for label $i$ at each |
| node to compute Gini impurity and entropy. |
| |
| For regression, we need to compute |
| \begin{align*} |
| \text{Var}(D) = \frac{1}{n} \sum_{i} y_i^2 - \left(\frac{1}{n} \sum_{i}y_{i} \right) |
| \end{align*} |
| where $y_{i}$ is value for attribute $Y$ and $n$ is the number of examples in $D$. |
| One important observation is that both label frequency and variance can be |
| computed from sufficient statistics. |
| |
| For label frequency, the sufficient statistics we aggregate when passing over |
| the training examples are |
| \begin{itemize} |
| \item The number of examples with label $i$, $N_i = \sum_i I(i)$ |
| \end{itemize} |
| For variance, the sufficient statistics we aggregate when passing over the |
| training examples are |
| \begin{itemize} |
| \item The number of training $N = \sum_i 1$ |
| \item The sum of values $S = \sum_i y_i$ |
| \item The sum of square values $Q = \sum_i y_i^2$ |
| \end{itemize} |
| |
| Note that the computations of the above sufficient statistics are associative, |
| we then have the following high level algorithm for parallel decision tree |
| construction. Assume that we are training nodes at level $l$, in each segment, |
| we store the aggregated values for sufficient statistics that we need to compute |
| purity. As we pass over each record $r$ in each segment, we find the |
| corresponding bin for each attributes in $r$ and add to the aggregated |
| statistics. When all segments finish data processing, the aggregated |
| sufficient statistics will be merged. Once we have the overall aggregated |
| sufficient statistics, we find the split that maximizes impurity for each node |
| at the level $l$. |
| |
| \begin{algorithm}[findBestSplits$(D^*, M, n_l, B)$] \label{alg:findBestSplits} |
| \alginput{Training dataset $D^*$,\\ |
| Tree model $M$, \\ |
| Set of nodes at level $l$, $n_l$, Binning info $B$} |
| \algoutput{split that maximizes impurity for each node in $n_l$} |
| \begin{algorithmic}[1] |
| \State On each segment, perform the following \Comment{Transition function} |
| \For{$d \in D^*$} |
| \For{$attr\_value \in d$} |
| \State Find the bin index in $B$ that $attr\_value$ belongs to |
| \State Add sufficient statistics $s$ to aggregate |
| \EndFor |
| \EndFor |
| \State Merge aggregated statistics on different segments \Comment{Merge function} |
| \State Find splits that maximize impurity for each node in $n_l$ \Comment{Final function} |
| \end{algorithmic} |
| \end{algorithm} |
| For Algorithm~\ref{alg:findBestSplits} to handle weights for each training example, when we add sufficient statistics $s$ to aggregate, |
| we instead add $w \times s$ to aggregage, where $w$ is the weight for the traning example. |
| |
| % section parallelism (end) |