| % When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines. |
| %!TEX root = ../design.tex |
| |
| \chapter[Regression]{Regression} |
| \label{chap:regression} |
| \begin{moduleinfo} |
| \item[Authors] {Rahul Iyer and Hai Qian} |
| \item[History] |
| \begin{modulehistory} |
| \item[v0.3] Added section on Clustered Sandwich Estimators |
| \item[v0.2] Added section on Marginal Effects |
| \item[v0.1] Initial version, including background of regularization |
| \end{modulehistory} |
| \end{moduleinfo} |
| |
| \newcommand{\bS}[1]{\boldsymbol{#1}} |
| |
| |
| % Abstract. What is the problem we want to solve? |
| |
| Regression analysis is a statistical tool for the investigation of |
| relationships between variables. Usually, the investigator seeks to ascertain |
| the causal effect of one variable upon another - the effect of a price increase |
| upon demand, for example, or the effect of changes in the money supply upon the |
| inflation rate. More specifically, regression analysis helps one understand how |
| the typical value of the dependent variable changes when any one of the |
| independent variables is varied, while the other independent variables are held |
| fixed. |
| |
| Regression models involve the following variables: |
| \begin{enumerate} |
| \item The unknown parameters, denoted as $\beta$, which may represent a scalar or a vector. |
| \item The independent variables, $x$ |
| \item The dependent variables, $y$ |
| \end{enumerate} |
| |
| %\section{Linear Methods for Regression} % (fold) |
| %\label{sub:linear_methods_for_regression} |
| |
| % subsection linear_methods_for_regression (end) |
| |
| \section{Multinomial Logistic Regression}\label{sec:multi_intro} |
| |
| Multinomial logistic regression is a widely used regression analysis tool that |
| models the outcomes of categorical dependent random variables. |
| \textit{Generalized linear models} identify key ideas shared by a broad class of |
| distributions thereby extending techniques used in linear regression, into the |
| field of logistic regression. |
| |
| This document provides an overview of the theory of multinomial logistic |
| regression models followed by a design specification of how the model can be |
| estimated using maximum likelihood estimators. In the final section, we outline |
| a specific implementation of the algorithm that estimates multinomial logistic |
| regression models on large datasets. |
| |
| \subsection{Problem Description}\label{sec:multi_problem} |
| |
| In this section, we setup the notation for a generic multinomial regression |
| problem. Let us consider an $N$-dimensional multinomial random variable $\bS{Z}$ |
| that can take values in $J$ different categories, where $J \geq 2$. As input to |
| the problem, we are given an $N\times J$ matrix of observations $\bS{y}$. Here |
| $y_{i,j}$ denotes the observed value for the $j^{th}$ category of the random |
| variable $Z_i$. Analogous to the observed values, we define a set of parameters |
| $\bS{\pi}$ as an $N\times J$ matrix with each entry $\pi_{i,j}$ denoting the |
| probability of observing the random variable $Z_i$ to fall in the $j^{th}$ |
| category. In logistic regression, we assume that the random variables $\bS{Z}$ |
| are explained by a \textit{design matrix} of independent random variables $\bS{X}$ |
| which contains $N$ rows and $(K+1)$ columns. We define a regression coefficient |
| $\bS{\beta}$ as a matrix with $K+1$ rows and $J$ columns such that $\beta_{k,j}$ |
| corresponds to the importance while predicting the $j^{th}$ value of the |
| $k^{th}$ explanatory variable. |
| |
| For the multinomial regression model, we assume the observations $\bS{y}$ are |
| realizations of random variables $\bS{Z}$ which are explained using random |
| variables $\bS{X}$ and parameters $\bS{\beta}$. More specifically, if we |
| consider the $J^{th}$ category to be the `pivot' or `baseline' category, then |
| the log of the odds of an observation compared to the $J^{th}$ observation can |
| be predicted using a linear function of variables $\bS{X}$ and parameters |
| $\bS{\beta}$. |
| |
| \begin{equation} |
| \log \Big( \frac{\pi_{i,j}}{\pi_{i,J}}\Big) |
| = \log \Big( \frac{\pi_{i,j}}{\sum_{j=1}^{J-1}\pi_{i,j}}\Big) |
| = \sum_{k=0}^{K} x_{i,k}\beta_{k,j} |
| % \ \ \ \ \ \aI, \aJ |
| \end{equation} |
| |
| Solving for $\pi_{i,j}$, we have |
| \begin{subequations} |
| \label{eq:pi_def} |
| \begin{align} |
| \pi_{i,j} &= \frac{\exp \big( {\sum_{k=0}^{K} x_{i,k}\beta_{k,j}}\big)}{1 + \sum_{j=1}^{J-1}\exp{\big( \sum_{k=0}^{K} x_{i,k}\beta_{k,j}\big)}}\ \ \ \ \forall j < J\\ |
| \pi_{i,J} &= \frac{1}{1 + \sum_{j=1}^{J-1}\exp{\big( \sum_{k=0}^{K} x_{i,k}\beta_{k,j}\big)}} |
| \end{align} |
| \end{subequations} |
| |
| In a sentence, the goal of multinomial regression is to use observations |
| $\bS{y}$ to estimate parameters $\bS{\beta}$ that can predict random variables |
| $\bS{Z}$ using explanatory variables $\bS{X}$. |
| |
| \subsection{Parameter Estimation} |
| |
| We evaluate the parameters $\bS{\beta}$ using a maximum-likelihood estimator |
| (MLE) which maximizes the likelihood that a certain set of parameters predict |
| the given set of observations. For this, we define the following likelihood |
| function: |
| |
| \begin{align} |
| L(\bS{\beta}|\bS{y}) &\simeq \prod_{i=1}^{N}\prod_{j=1}^{J}{\pi_{i,j}}^{y_{i,j}} \\ |
| \intertext{Substituting \eqref{eq:pi_def} in the above expression, we have} |
| &= \prod_{i=1}^{N}\prod_{j=1}^{J-1} \Big( 1 + \sum_{j=1}^{J-1} e^{\sum_{k=0}^{K} x_{i,k}\beta_{k,j}} \Big) e^{y_{i,j} \sum_{k=0}^{K} x_{i,k}\beta_{k,j}} \\ |
| \intertext{Taking the natural logarithm of $L(\bS{\beta}|\bS{y})$ defined above, we derive the following expression for the log-likelihood function, $l(\bS{\beta})$ as:} |
| l(\bS{\beta}) &= \sum_{i=1}^{N}\sum_{j=1}^{N} \Big( y_{i,j} \sum_{k=0}^{K} x_{i,k}\beta_{k,j} \Big) - \log \Big( 1 + \sum_{j=1}^{J-1} \sum_{k=0}^{K} x_{i,k}\beta_{k,j}\Big) \label{eq:log_lik} |
| \end{align} |
| |
| The maximum likelihood estimator tries maximize the log-likelihood function as |
| defined in Equation \eqref{eq:log_lik}. Unlike linear regression, the MLE has to |
| be obtained numerically. Since we plan to implement derivative based algorithms |
| to solve $\max_{\bS{\beta}} l(\bS{\beta})$, we first derive expressions for the |
| first and second derivatives of the log-likelihood function. |
| |
| We differentiate \eqref{eq:log_lik} with respect to each parameter $\beta_{k,j}$ |
| \begin{equation}\label{eq:first_derivative} |
| \frac{\partial l(\bS{\beta})}{\partial \beta_{k,j}} = \sum_{i=1}^{N} y_{i,j}x_{i,k} - \pi_{i,j}x_{i,k} \ \ \ \ \forall k \ \forall j |
| \end{equation} |
| |
| We evaluate the extreme point of the function $l(\bS{\beta})$ by setting each |
| equation of \eqref{eq:first_derivative} to zero. We proceed on similar lines to |
| derive the second order derivative of the $l(\bS{\beta})$ with respect to two |
| parameters $\beta_{k_1,j_1}$ and $\beta_{k_2,j_2}$ |
| |
| \begin{subequations} |
| \begin{align}\label{eq:second_derivative} |
| \frac{\partial^2 l(\bS{\beta})}{\partial \beta_{k_2,j_2} \partial \beta_{k_1,j_1}} |
| &= \sum_{i=1}^{N} -\pi_{i,j_2}x_{i,k_2}(1-\pi_{i,j_1})x_{i,k_1} &&j_1 = j_2 \\ |
| &= \sum_{i=1}^{N} \pi_{i,j_2}x_{i,k_2}\pi_{i,j_1}x_{i,k_1} &&j_1 \neq j_2 |
| \end{align} |
| \end{subequations} |
| |
| \subsection{Algorithms} |
| |
| Newton's method is a numerical recipe for root finding of non-linear functions. |
| We apply this method to solve all nonlinear equations produced by setting |
| \eqref{eq:first_derivative} to zero. Newton's method begins with an initial |
| guess for the solution after which it uses the Taylor series approximation of |
| function at the current iterate to produce another estimate that might be closer |
| to the true solution. This iterative procedure has one of the fastest |
| theoretical rates of convergence in terms of number of iterations, but requires |
| second derivative information to be computed at each iteration which is a lot |
| more work than other light-weight derivative based methods. |
| |
| Newton method can be compactly described using the update step. Let us assume |
| $\bS{\beta^0}$ to be the initial guess for the MLE (denoted by $MLE$). If the |
| $\bS{\beta^I}$ is the `guess' for $MLE$ at the $I^{th}$ iteration, then we can |
| evaluate $\bS{\beta^{I+1}}$ using |
| |
| \begin{align}\label{eq:update} |
| \bS{\beta^{I+1}} &= \bS{\beta^{I}} - [l''(\bS{\beta^{I}})]^{-1} \ l'(\bS{\beta^{I}}) \\ |
| &= \bS{\beta^{I}} - [l''(\bS{\beta^{I}})]^{-1} \bS{X}^{T} (\bS{y} - \bS{\pi}) |
| \end{align} |
| |
| where $\bS{X}^{T} (\bS{y} - \bS{\pi})$ is matrix notation for the first order |
| derivative. The newton method might have proven advantage in terms of number of |
| iterations on small problem sizes but it might not scale well to larger sizes |
| because it requires an expensive step for the matrix inversion of the second |
| order derivative. |
| |
| As an aside, we observe that Equation \eqref{eq:first_derivative} and |
| \eqref{eq:second_derivative} refers to the first derivative in matrix form and |
| the second derivative in tensor form respectively. In the implementation phase, |
| we work with `vectorized' versions of $\bS{\beta},\bS{X},\bS{y}, \bS{\pi}$ |
| denoted by $\vec{\beta},\vec{X},\vec{y}, \vec{\pi}$ respectively where the |
| matrix are stacked up together in row major format. |
| |
| Using this notation, we can rewrite the first derivative in \eqref{eq:first_derivative} as: |
| \begin{equation}\label{eq:first_derivative_vectorized} |
| \frac{\partial l(\bS{\beta})}{\partial \vec{\beta}} = \vec{X}^{T} (\vec{y} - \vec{\pi}) |
| \end{equation} |
| Similarly, we can rewrite the second derivative in \eqref{eq:second_derivative} as: |
| \begin{equation}\label{eq:second_derivative_vectorized} |
| \frac{\partial^2 l(\bS{\beta})}{\partial^2 \vec{\beta}} = \vec{X}^{T} W \vec{X} |
| \end{equation} |
| where $W$ is a diagonal matrix of dimension $(K+1) \times J $ where the diagonal |
| elements are set $\pi_{i,j_1}\pi_{i,j_2}$ if $j_1 \neq j_2$ or |
| $\pi_{i,j_1}(1-\pi_{i,j_2})$ otherwise. Note that |
| \eqref{eq:second_derivative_vectorized} is merely a compact way to write |
| \eqref{eq:second_derivative}. |
| |
| The Newton method procedure is illustrated in Algorithm \ref{alg:Newton}. |
| \begin{algorithm} |
| % [Newton's Method] \label{alg:newton} |
| % \caption{Newton's Method} |
| \alginput{$\vec{X},\vec{Y}$ and an initial guess for parameters $\vec{\beta^{0}}$} |
| \algoutput{The maximum likelihood estimator $\beta_{MLE}$} |
| \begin{algorithmic}[1] |
| \State $I \leftarrow 0$ |
| \Repeat |
| \State Diagonal Weight matrix $W$: $w_{j_1,j_2} \leftarrow \pi_{i,j_1}\pi_{i,j_2}$ if $j_1 \neq j_2$ or $\pi_{i,j_1}(1-\pi_{i,j_2})$ otherwise |
| \State Compute $\vec{\beta}^{I+1}$ using: |
| \State $\vec{\beta}^{I+1} =\vec{\beta}^{I} - (\vec{X}^{T} W \vec{X})^{-1} \vec{X}^{T} (\vec{y} - \vec{\pi}) $ |
| \Until{$\vec{\beta}^{I+1}$ converges} |
| \end{algorithmic} |
| \label{alg:Newton} |
| \end{algorithm} |
| |
| \subsection{Common Statistics}\label{sec:stats} |
| Irrespective of the solver that we choose to implement, we would need to calculate the standard errors and p-value. |
| |
| Asymptotics tells that the MLE is an asymptotically efficient estimator. This means that it reaches the following Cramer-Rao lower bound: |
| \begin{equation} |
| \sqrt{n}(\beta_{MLE} - \beta) \rightarrow \mathcal{N}(0,I^{-1}) |
| \end{equation} |
| where $I^{-1}$ is the Fisher information matrix. Hence, we evaluate the standard errors using the asymptotic variance of $i^{th}$ parameter (in vector form) of the MLE as: |
| \begin{equation} |
| se(\vec{\beta}_i) = (\vec{X}^{T} W \vec{X})^{-1}_{i} |
| \end{equation} |
| |
| The Wald statistic is used to assess the significance of coefficients $\beta$. |
| In the Wald test, the MLE is compared with the null hypothesis while assuming |
| that the difference between them is approximately normally distributed. The Wald |
| p-value for a coefficient provides us with the probability of seeing a value as |
| extreme as the one observed, when null hypothesis is true. We evaluate the Wald |
| p-value as: |
| |
| \begin{equation} |
| p_i = \mbox{Pr}\Big(|Z| \geq \big|\frac{\vec{\beta}_i}{se(\vec{\beta}_i)} \big| \Big) = 2 \Big[1 - F(se(\vec{\beta}_i)) \Big] |
| \end{equation} |
| where $Z$ is a normally distributed random variable and $F$ represents the cdf of $Z$. |
| |
| |
| \section{Implementation}\label{sec:implem} |
| |
| For the implementation, we plan to mimic the framework used for the current |
| implementation of the logistic regression. In this framework, the regression is |
| broken up into 3 steps, denoted as the \textit{transition} step, the |
| \textit{merge states} step, and the \textit{final} step. Much of the |
| computation is done in parallel, and each transition step operates on a small |
| number of rows in the data matrix to compute a portion of the calculation for |
| $\vec{X}^T W\vec{X}$, as well as the calculation for $\vec{X}^TW\vec{\alpha}$. |
| This is used in the Hessian calculation in equation |
| \ref{eq:second_derivative_vectorized}. |
| |
| The merge step aggregates these transition steps together. This consists of |
| summing the $X^T WX$ calculations, summing the $\vec{X}^TW\vec{\alpha}$ |
| calculations, and updating the bookkeeping. The final step computes the |
| current solution $\vec{\beta}$, and returns it and its associated statistics. |
| |
| The three steps in the framework communicate by passing around \textit{state} objects. Each state object carries with it several fields: |
| \begin{description} |
| \item [coef] This is the column vector containing the current solution. This is the $\beta$ part of $\beta^TX$ with the exception of $\beta_0$. |
| \item [beta] The scalar constant term $\beta_0$ in $\beta^TX$. |
| \item [widthOfX] The number of features in the data matrix. |
| \item [numRows] The number of data points being operated on. |
| \item [dir] The direction of the gradient it the $k$th step. Not sure about this one. |
| \item [grad] The gradient vector in the $k$th step. |
| \item [gradNew] The gradient vector in the $x+1$th step. |
| \item [X\_transp\_AX] The current sum of $\vec{X}^T W\vec{X}$. |
| \item [X\_transp\_Az] The current sum of $\vec{X}^TW\vec{\alpha}$. |
| \item [logLikelihood] The log likelihood of the solution. |
| \end{description} |
| |
| Each transition step is given an initial state, and returns a state with the updated X\_transp\_AX and X\_transp\_Az fields. These states are iteratively passed to a merge step, which combines them two at a time. The final product is then passed to the final step. We expect to use the same system, or something similar. |
| |
| We can formalize the API's for these three steps as: |
| \begin{verbatim} |
| multilogregr_irls_step_transition(AnyType *Args) |
| multilogregr_irls_step_merge(AnyType *Args) |
| multilogregr_irls_step_final(AnyType *Args) |
| \end{verbatim} |
| The \texttt{AnyType} object is a generic type used in madlib to retrieve and return values from the backend. Among other things, an \texttt{AnyType} object can be a \texttt{NULL} value, a scalar, or a state object. |
| |
| The first step, \texttt{multilogregr\_cg\_step\_transition}, will expect \texttt{*Args} to contains the following items in the following order: a state object for the current state, a vector $x$ containing a row of the design matrix, a vector $y$ containing the values of $\bS{Z}$ for this row, and a state object for the previous state. The return value for this function will be another state object. |
| |
| The second step \texttt{multilogregr\_cg\_step\_merge} will expect \texttt{*Args} to contain two state objects, and will return a state object expressing the merging of the two input objects. The final step \texttt{multilogregr\_cg\_step\_final} expects a single state object, and returns an \texttt{AnyType} object containing the solution's coefficients, the standard error, and the solution's p values. |
| |
| |
| \section{Regularization} % (fold) |
| \label{sub:regularization} |
| |
| Usually, $y$ is the result of measurements contaminated by small errors (noise). |
| Frequently, ill-conditioned or singular systems also arise in the iterative |
| solution of nonlinear systems or optimization problems. In all such situations, |
| the vector $x = {A}^{-1}y$ (or in the full rank overdetermined case $A^+ y$, |
| with the pseudo inverse $A^+ = (A^T A)^{-1}A^T X)$, if it exists at all, is |
| usually a meaningless bad approximation to x. |
| |
| Regularization techniques are needed to obtain meaningful solution estimates |
| for such ill-posed problems, and in particular when the number of parameters |
| is larger than the number of available measurements, so that standard least |
| squares techniques break down. |
| |
| \subsection{Linear Ridge Regression} |
| Ridge regression is the most commonly used method of regularization of |
| ill-posed problems. Mathematically, it seeks to minimize |
| |
| \begin{equation} |
| Q\left(\vec{w},w_0;\lambda\right)\equiv \min_{\vec{w},w_0}\left[ \frac{1}{2N} \sum_{i=1}^{N} \left( y_i - w_0 - |
| \vec{w} \cdot \vec{x}_i \right)^2 |
| +\frac{\lambda}{2}\|\vec{w}\|_2^2 \right]\ , |
| \end{equation} |
| for a given value of $\lambda$, where $\vec{w}$ and $w_0$ are the fitting coefficients, and $\lambda$ |
| is a non-negative regularization parameter. $\vec{w}$ is a vector in |
| $d$ dimensional space, and |
| \begin{equation} |
| \|\vec{w}\|_2^2 = \sum_{j=1}^{d}w_j^2 = \vec{w}^T\vec{w}\ . |
| \end{equation} |
| When $\lambda = 0$, $Q$ is |
| the mean squared error of the fitting. |
| |
| The intercept term $w_0$ is not regularized, because this term is |
| fully decided by the mean values of $y_i$ and $\vec{x}_i$ and the |
| values of $\vec{w}$, and does not affect the model's complexity. |
| |
| $Q\left(\vec{w},w_0;\lambda\right)$ is a quadratic function of $\vec{w}$ and |
| $w_0$, and thus can be solved analytically |
| \begin{equation} |
| \vec{w}_{ridge}=\left(\lambda\vec{I}_d + |
| \vec{X}^T\vec{X}\right)^{-1}\vec{X}^T\vec{y}\ . |
| \end{equation} |
| By using the available Newton method (Sec. 6.2.4), the above quantity can be easily |
| calculated from one single step of the Newton method. |
| |
| Many packages for Ridge regularization actually regularize the fitting |
| coefficients not for the fitting model for the original data but for |
| the data that has be scaled. MADlib also provides this option. When |
| the normalization parameter is set to be True, which is by default |
| False, the data will be first converted to the following before |
| applying the Ridge regularization. |
| \begin{equation} |
| x_i' \leftarrow \frac{x_i - \langle x_i \rangle}{\langle (x_i - |
| \langle x_i \rangle)^2\rangle} \ , |
| \end{equation} |
| \begin{equation} |
| y_i \leftarrow y_i - \langle y_i \rangle \ , |
| \end{equation} |
| where $\langle \cdot \rangle = \sum_{i=1}^{N} \cdot / N$. |
| |
| Note that Ridge regressions for scaled data and un-scaled data are not equivalent. |
| |
| \subsection{Elastic Net Regularization} % (fold) |
| \label{ssub:elastic_net_regularization} |
| As a continuous shrinkage method, ridge regression achieves its better |
| prediction performance through a bias-variance trade-off. However, ridge |
| regression cannot produce a parsimonious model, for it always keeps all the |
| predictors in the model~\cite{zou2005}. Best subset selection in contrast |
| produces a sparse model, but it is extremely variable because of its inherent |
| discreteness. |
| |
| A promising technique called the lasso was proposed by Tibshirani (1996). The |
| lasso is a penalized least squares method imposing an L1-penalty on the |
| regression coefficients. Owing to the nature of the L1-penalty, the lasso does |
| both continuous shrinkage and automatic variable selection simultaneously. |
| |
| Although the lasso has shown success in many situations, it has some |
| limitations. Consider the following three scenarios: |
| \begin{enumerate} |
| \item In the `Number of features' ($p$) >> `Number of observations' ($n$) case, |
| the lasso selects at most $n$ variables before it saturates, because of the |
| nature of the convex optimization problem. This seems to be a limiting |
| feature for a variable selection method. Moreover, the lasso is not well |
| defined unless the bound on the $L_1$-norm of the coefficients is smaller |
| than a certain value. |
| |
| \item If there is a group of variables among which the pairwise correlations are |
| very high, then the lasso tends to select only one variable from the group |
| and does not care which one is selected. |
| |
| \item For usual n>p situations, if there are high correlations between predictors, |
| it has been empirically observed that the prediction performance of the |
| lasso is dominated by ridge regression. |
| |
| \end{enumerate} |
| |
| These scenarios make lasso an inappropriate variable selection method in some situations. |
| |
| Hui Zou and Trevor Hastie [42] introduce a new regularization |
| technique called the `elastic net'. Similar to the lasso, the elastic net |
| simultaneously does automatic variable selection and continuous |
| shrinkage, and it can select groups of correlated variables. It is like a |
| stretchable fishing net that retains `all the big fish'. |
| |
| The elastic net regularization minimizes the following target function |
| \begin{equation} \label{eq:target} |
| \min_{\vec{w} \in R^N}L(\vec{w}) + \lambda \left[\frac{1-\alpha}{2}\|\vec{w}\|_2^2 + |
| \lambda\alpha \|\vec{w}\|_1\right]\ , |
| \end{equation} |
| where $\|\vec{w}\|_1 = \sum_{i=1}^N|w_i|$ and $N$ is the number of features. |
| |
| For the elastic net regularization on linear models, |
| \begin{equation} |
| L(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot |
| \vec{x}_m\right)^2\ , |
| \end{equation} |
| where the sum is over all observations and $M$ is the total number of |
| observation. |
| |
| For the elastic net regularization on logistic models, |
| \begin{equation} |
| L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + |
| \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + |
| \vec{w}\cdot\vec{x}_m}\right)\right]\ , |
| \end{equation} |
| where $y_m \in \{0,1\}$. |
| |
| \subsubsection{Optimizer Algorithms} |
| Right now, we support two algorithms for optimizer. The default one is |
| FISTA, and the other is IGD. |
| |
| \paragraph{FISTA} |
| |
| Fast Iterative Shrinkage Thresholding Algorithm (FISTA) with |
| \textbf{backtracking step size} [4]: |
| \vspace{0.2in} |
| \hrule |
| \vspace{0.2in} |
| \textbf{Step $0$}: Choose $\delta>0$ and $\eta > 1$, and |
| $\vec{w}^{(0)} \in R^N$. Set $\vec{v}^{(1)}=\vec{w}^{(0)}$ and |
| $t_1=1$. |
| |
| \textbf{Step $k$}: ($k \ge 1$) Find the smallest nonnegative integers |
| $i_k$ such that with $\bar{\delta} = \delta_{k-1}/\eta^{i_k-1}$ |
| \begin{equation} |
| F(p_{\bar{\delta}}(\vec{v}^{(k)})) \le |
| Q_{\bar{\delta}}(p_{\bar{\delta}}(\vec{v}^{(k)}), \vec{v}^{k})\ . |
| \end{equation} |
| Set $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ and compute |
| \begin{equation} |
| \vec{w}^{(k)} = p_{\delta_k}\left(\vec{v}^{(k)}\right)\ , |
| \end{equation} |
| \begin{equation} |
| t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2}\ , |
| \end{equation} |
| \begin{equation} |
| \vec{v}^{(k+1)} = \vec{w}^{(k)} + |
| \frac{t_k-1}{t_{k+1}}\left(\vec{w}^{(k)} - \vec{w}^{(k-1)}\right)\ . |
| \end{equation} |
| \vspace{0.2in} |
| \hrule |
| \vspace{0.2in} |
| Here, |
| \begin{equation} |
| F(\vec{w}) = f(\vec{w}) + g(\vec{w})\ , |
| \end{equation} |
| where $f(\vec{w})$ is the differentiable part of |
| Eq. (\ref{eq:target}) and $g(\vec{w})$ is the non-differentiable part. |
| For linear models, |
| \begin{equation} |
| f(\vec{w}) = \frac{1}{2M}\sum_{m=1}^M\left(y_m - w_0 - \vec{w} \cdot |
| \vec{x}_m\right)^2 + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ , |
| \end{equation} |
| and for logistic models, |
| \begin{equation} |
| f(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + |
| \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + |
| \vec{w}\cdot\vec{x}_m}\right)\right] + \frac{\lambda(1-\alpha)}{2}\|\vec{w}\|_2^2\ . |
| \end{equation} |
| And for both types of models, |
| \begin{equation} |
| g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|\ . |
| \end{equation} |
| |
| And |
| \begin{equation} |
| Q_{\delta}(\vec{a}, \vec{b}) := f(\vec{b}) + \langle \vec{a} - |
| \vec{b}, \nabla f(\vec{b})\rangle + |
| \frac{1}{2\delta}\|\vec{a} - \vec{b}\|^2 + g(\vec{a})\ , |
| \end{equation} |
| where $\langle\cdot\rangle$ is just the usual vector dot product. |
| |
| And the proxy function is defined as |
| \begin{equation} |
| p_\delta (\vec{v}) := \underset{\vec{w}}{\operatorname{arg\,min}} \left[g(\vec{w}) + |
| \frac{1}{2\delta}\left\|\vec{w} - \left(\vec{v} - \delta\,\nabla f(\vec{v})\right)\right\|^2 \right] |
| \end{equation} |
| For our case, where $g(\vec{w}) = \lambda\alpha\sum_{i=1}^N|w_i|$, the |
| proxy function is simply equal to the soft-thresholding function |
| |
| \begin{equation} |
| p_\delta (v_i) = \left\{ \begin{array}{ll} |
| v_i - \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i > \lambda\alpha\delta |
| u_i \\ |
| 0\ , \quad & \mbox{otherwise} \\ |
| v_i + \lambda\alpha\delta u_i\ , \quad & \mbox{if } v_i < - \lambda\alpha\delta u_i |
| \end{array} |
| \right\} |
| \end{equation} |
| where |
| \begin{equation} |
| \vec{u} = \vec{v} - \delta\,\nabla f(\vec{v})\ . |
| \end{equation} |
| |
| \textbf{Active set method} is used in our implementation for FISTA to speed up the |
| computation. Considerable speedup is obtained by organizing the iterations |
| around the active set of features - those with nonzero coefficients. After a |
| complete cycle through all the variables, we iterate on only the active set till |
| convergence. If another complete cycle does not change the active set, we are |
| done, otherwise the process is repeated. |
| |
| \textbf{Warm-up method} is also used to speed up the computation. When the option |
| parameter $warmup$ is set to be $True$, a series of lambda values, which is |
| strictly descent and ends at the lambda value that the user wants to calculate, |
| will be used. The larger lambda gives very sparse solution, and the sparse |
| solution again is used as the initial guess for the next lambda's solution, |
| which will speed up the computation for the next lambda. For larger data sets, |
| this can sometimes accelerate the whole computation and might be faster than |
| computation on only one lambda value. |
| |
| \textbf{Note:} Our implementation is a little bit different from the original |
| FISTA. In the original FISTA, during the backtracking procedure, the algorithm |
| is searching for a non-negative integer $i_k$ and the new step size is $\delta_k |
| = \delta_{k-1}/\eta^{i_k}$. Thus the step size is non-increasing. Here, we allow |
| the step size to increase by using $\delta_k = \delta_{k-1}/\eta^{i_k-1}$ so |
| that larger step sizes can be tried by the algorithm. Tests show that this is |
| actually faster. |
| |
| \paragraph{IGD} |
| |
| The Incremental Gradient Descent (IGD) algorithm is a stochastic algorithm by |
| its nature. So it is difficult to get sparse solutions. What we implemented is |
| Stochastic Mirror Descent Algorithm made Sparse (SMIDAS). The inverse p-form |
| link function is used |
| |
| \begin{equation} \label{eq:f} |
| h_j^{-1}(\vec{\theta}) = \frac{\mbox{sign}(\theta_j)\vert \theta_j |
| \vert^{p-1}}{\|\theta\|_p^{p-2}}\ , |
| \end{equation} |
| where |
| \begin{equation} |
| \|\theta\|_p = \left(\sum_j \vert \theta \vert ^p\right)^{1/p}\ , |
| \end{equation} |
| and $\mbox{sign}(0) = 0$. |
| \vspace{0.2in} |
| \hrule |
| \vspace{0.2in} |
| Choose step size $\delta > 0$. |
| |
| Let $p = 2\log d$ and let $h^{-1}$ be as in Eq. (\ref{eq:f}) |
| |
| Let $\vec{\theta} = \vec{0}$ |
| |
| \textbf{for} $k = 1,2,\dots$ |
| |
| \quad $\vec{w} = h^{-1}(\vec{\theta})$ |
| |
| \quad $\vec{v} = \nabla f(\vec{w})$ |
| |
| \quad $\tilde{\vec{\theta}} = \theta - \delta \, \vec{v}$ |
| |
| \quad $\forall j, \theta_j = \mbox{sign}(\tilde{\theta}_j)\max (0, |
| \vert \tilde{\theta}_j \vert- \lambda\alpha\delta)$ |
| \vspace{0.2in} |
| \hrule |
| \vspace{0.2in} |
| |
| The resulting fitting coefficients of this algorithm is not really |
| sparse, but the values are very small (usually $< 10^{15}$), which can |
| be safely set to be zero after filtering with a threshold. |
| |
| This is done as the following: (1) multiply each coefficient with the |
| standard deviation of the corresponding feature (2) compute the |
| average of absolute values of re-scaled coefficients (3) divide each |
| rescaled coefficients with the average, and if the resulting absolute |
| value is smaller than \emph{threshold}, set the original coefficient |
| to be zero. |
| |
| IGD is in nature a sequential algorithm, and when running in a |
| distributed way, each segment of the data runs its own SGD model, and |
| the models are averaged to get a model for each iteration. This |
| average might slow down the convergence speed, although we acquire the |
| ability to process large data set on multiple machines. So this |
| algorithm provides an option \emph{parallel} to let the user choose |
| whether to do parallel computation. |
| |
| IGD also implements the \textbf{warm-up method}. |
| |
| \textbf{Stopping Criteria} Both \textbf{FISTA} and \textbf{IGD} compute the average difference |
| between the coefficients of two consecutive iterations, and if the |
| difference is smaller than \emph{tolerance} or the iteration number |
| is larger than \emph{max\_iter}, the computation stops. |
| |
| The resulting fitting coefficients of this algorithm is not really sparse, but |
| the values are very small (usually $< 10^{15}$), which can be safely set to be |
| zero after filtering with a threshold. |
| |
| This is done as the following: (1) multiply each coefficient with the standard |
| deviation of the corresponding feature (2) compute the average of absolute |
| values of re-scaled coefficients (3) divide each rescaled coefficients with the |
| average, and if the resulting absolute value is smaller than <em>threshold</em>, |
| set the original coefficient to be zero. |
| |
| IGD is in nature a sequential algorithm, and when running in a distributed way, |
| each segment of the data runs its own SGD model, and the models are averaged to |
| get a model for each iteration. This average might slow down the convergence |
| speed, although we acquire the ability to process large data set on multiple |
| machines. So this algorithm provides an option <em>parallel</em> to let the user |
| choose whether to do parallel computation. |
| |
| IGD also implements the \textbf{warm-up method}. |
| |
| \textbf{Stopping Criteria} Both \textbf{FISTA} and \textbf{IGD} compute the average |
| difference between the coefficients of two consecutive iterations, and if the |
| difference is smaller than <em>tolerance</em> or the iteration number is larger |
| than <em>max\_iter</em>, the computation stops. |
| |
| |
| We note that the Hessian and derivative both depend on $\beta$, so the variance estimates will also depend on $\beta$. Rather than allow the user to specify a $\beta$ value, the implementation computes the optimal $\beta$ by running the appropriate regression before computing the robust variance. In cases where the regression has parameters (regression tolerance, max iterations), the interface allows the user to specify those parameters. |
| |
| % subsubsection elastic_net_regularization (end) |
| % subsection regularization (end) |
| |
| \section{Robust Variance via Huber-White Sandwich Estimators} |
| Given $N$ data points, where the $i$th point is defined by a feature vector $x_i \in \mathbb{R}^M$ and a category scalar $y_i$, where $y_i \in \mathbb{R}, y_i \in \{0,1 \}$, and $y_i \in \{1,\dots, L \}$ for linear, logistic, and multi-logistic regression respectively. We assume that $y_i$ is drawn from an independent and identically distributed (i.i.d.) distribution determined by a $K$-dimensional parameter vector $\beta$ (which is $K\times L$ dimensional for multi-logistic regression). |
| |
| Generally, we are interested in finding the values of $\beta$ that best predict $y_i$ from $x_i$, with \textit{best} being defined as the values that maximize some log-likelihood function $l(y,x,\beta)$. The maximization is typically solved using the derivative of the likelihood $\psi$ and the Hessian $H$. More formally, $\psi$ is defined as |
| \begin{align} |
| \psi(y,x, \beta) = \frac{\partial l(x,y,\beta)}{\partial \beta} |
| \end{align} |
| and $H$ is defined as |
| \begin{align} |
| H(y,x, \beta) = \frac{\partial^2 l(x,y,\beta)}{\partial \beta^2}. |
| \end{align} |
| Using these derivatives, one can solve the logistic or linear regression for the optimal solution, and compute the variance and other statistics of the regression. |
| |
| \subsection{Sandwich Operators} |
| However, we may believe that the underlying distribution is not i.i.d., (in particular, the variance is not independent of $x_i$), in which case, our variance estimates will be incorrect. |
| The Huber sandwich estimator is used to get a robust estimate of the variance even if the i.i.d. assumption is wrong. The estimator is known as a sandwich estimator because the robust covariance matrix $S(\beta)$ of $\beta$ can be expressed in a \textit{sandwich formulation}, of the form |
| \begin{align} |
| S(\beta) = B(\beta) M(\beta) B(\beta). |
| \end{align} |
| The $B(\beta)$ matrix is commonly called the \textit{bread}, whereas the $M(\beta)$ matrix is the \textit{meat}. |
| |
| \paragraph{The Bread} |
| |
| Computing $B$ is relatively straightforward, |
| \begin{align} |
| B(\beta) = N\left(\sum_i^N -H(y_i, x_i, \beta) \right)^{-1} |
| \end{align} |
| |
| \paragraph{The Meat} |
| |
| There are several choices for the $M$ matrix, each with different robustness properties. In the Huber-White estimator, the matrix $M$ is defined as |
| \begin{align} |
| M_{H} =\frac{1}{N} \sum_i^N \psi(y_i,x_i, \beta)^T \psi(y_i,x_i, \beta). |
| \end{align} |
| |
| |
| \subsection{Implementation} |
| |
| The Huber-White sandwich estimators implemented for linear, logistic, and multinomial-logistic regression mimic the same framework as the linear/logistic regression implementations. In these implementations, the gradient and Hessian are computed in parallel, and a final step operates on the aggregate result. |
| |
| This framework breaks the computation into three steps: \textit{transition} step, the \textit{merge states} step, and the \textit{final} step. The transition step computes the gradient and Hessian contribution from each row in the data matrix. To compute this step, we need to define the derivatives for each regression. |
| \subsubsection{Linear Regression Derivatives} |
| For linear regression, the derivative is |
| \begin{align} |
| \frac{\partial l(x,y,\beta)}{\partial \beta} = X^T(y - X \beta) |
| \end{align} |
| and the Hessian is |
| \begin{align} |
| \frac{\partial^2 l(x,y,\beta)}{\partial^2 \beta} = -X^TX. |
| \end{align} |
| |
| \subsubsection{Logistic Regression Derivatives} |
| For logistic, the derivative is |
| \begin{align} |
| \frac{\partial l(x,y,\beta)}{\partial \beta} = \sum_{i=1}^N \frac{-1^{y_i}\cdot \beta}{1 + \exp(-1^{y_i} \cdot \beta^Tx)} \exp(-1^{y_i}\cdot \beta^Tx) |
| \end{align} |
| and the Hessian is |
| \begin{align} |
| \frac{\partial^2 l(x,y,\beta)}{\partial^2 \beta} = -X^TAX = -\sum_{i=1}^N A_{ii} x_i^T x_i |
| \end{align} |
| where $A$ is a diagonal matrix with |
| \begin{align} |
| A_{ii} = \left( [1 + \exp(c^Tx_i) ) (1 + \exp(-c^Tx_i)] \right)^{-1}. |
| \end{align} |
| |
| \subsubsection{Multi-Logistic Regression Derivatives} |
| |
| For multi-logistic regression, we replace $y_i$ with a vector $Y_i \in \{0,1\}^L$, where all entries of $Y_i$ are zero expect the $y_i$th entry, which is set to 1. In addition, we choose a \textit{baseline} category, for which the odds of all the categories are measured against. Let $J \in \{1, \dots, L \}$ be the baseline category. |
| |
| We define the variables |
| \begin{align} |
| \pi_{i,j} &= \frac{\exp\left(\sum_{k=1}^K x_{i,k} \beta_{k,j} \right)}{1 + \sum_{j\ne J} \exp \left(\sum_{k=1}^K x_{i,k} \beta_{k,j} \right)}, \ \ \forall j \ne J\\ |
| \pi_{i,J} &= \frac{1}{1 + \sum_{j\ne J} \exp \left(\sum_{k=1}^K x_{i,k} \beta_{k,j} \right)}. |
| \end{align} |
| |
| The derivatives are then |
| \begin{equation}\label{eq:first_derivative2} |
| \frac{\partial l}{\partial \beta_{k,j}} = \sum_{i=1}^{N} Y_{i,j}x_{i,k} - \pi_{i,j}x_{i,k} \ \ \ \ \forall k \ \forall j |
| \end{equation} |
| The Hessian is then |
| \begin{align}\label{eq:second_derivative2} |
| \frac{\partial^2 l({\beta})}{\partial \beta_{k_2,j_2} \partial \beta_{k_1,j_1}} |
| &= \sum_{i=1}^{N} -\pi_{i,j_2}x_{i,k_2}(1-\pi_{i,j_1})x_{i,k_1} &&j_1 = j_2 \\ |
| &= \sum_{i=1}^{N} \pi_{i,j_2}x_{i,k_2}\pi_{i,j_1}x_{i,k_1} &&j_1 \neq j_2 |
| \end{align} |
| |
| \subsubsection{Merge Step and Final Step} |
| |
| For the logistic and multi-logistic, the derivative and Hessian are sums in |
| which the terms can be computed in parallel, and thus in a distributed manner. |
| Each transition step computes a single term in the sum. The merge step sums two |
| or more terms computed by the transition steps, and the final step computes the |
| Hessian inverse, and the matrix product between the bread and meat matrices. |
| |
| |
| \section{Marginal Effects} % (fold) |
| \label{sub:marginal_effects} |
| \textit{Most of the notes below are based on~\cite{diekmann2008}} |
| |
| A \emph{marginal effect} (ME) or partial effect measures the effect on the |
| conditional mean of $y$ of a change in one of the regressors, say |
| $x_k$~\cite{cameron2009}. In the linear regression model (without any interaction |
| terms), the ME equals the relevant slope coefficient, greatly simplifying analysis. |
| For nonlinear models, this is no longer the case, leading to remarkably many |
| different methods for calculating MEs. |
| |
| Let $E(y_i | x_i)$ represent the expected value of a dependent variable $y_i$ |
| given a vector of independent variables $x_i$. In the case where $y$ is a |
| linear function of $(x_1, \dots, x_M) = \vec{x}$ and $y$ is a continuous variable, a |
| linear regression model (without any interaction effects) can be stated as follows: |
| \begin{align*} |
| & y = \vec{x}^T \vec{\beta} \\ |
| & \text{or} \\ |
| & y = \beta_1 x_1 + \ldots + \beta_M x_M. |
| \end{align*} |
| From the above equation it is straightforward to see that the marginal effect of |
| any variable $x_k$ on the dependent variable is $\partial y / \partial x_k = |
| \beta_k$. However, this is true only if there exists no interaction between the |
| variables. If the output expression contains interaction terms, the model would be |
| \begin{gather*} |
| y = \beta_1 f_1 + \ldots + \beta_N f_N. |
| \end{gather*} |
| where $f_i$ is a function of the base variables $x_1, x_2, \ldots, x_M$ and |
| describes the interaction between the base variables. In the simple |
| (non-interaction) case, $f_i = x_i$ and $M = N$. |
| |
| The standard approach to modeling dichotomous/binary variables |
| (so $y \in {0, 1}$) is to estimate a generalized linear model under the |
| assumption that y follows some form of Bernoulli distribution. Thus the expected |
| value of $y$ becomes, |
| \begin{equation*} |
| y = G(\vec{x}^T \vec{\beta}), |
| \end{equation*} |
| where G is the specified binomial distribution. Here we assume to use |
| logistic regression and use $g$ to refer to the inverse logit function. The same |
| approach is applied when $y$ is a discrete, multi-valued variable. |
| |
| \subsection{Discrete change effect} % (fold) |
| \label{sub:discrete_change_effect} |
| Along with marginal effects we can also compute the following discrete change |
| effects. |
| |
| \begin{enumerate} |
| \item Unit change effect, which should not be confused with the |
| marginal effect: |
| \begin{align*} |
| \frac{\partial y}{\partial x_k} & = P(y=1|X,x_k+1) - P(y=1|X, x_k) \\ |
| & = g( \beta_1 x_1 + \dots + \beta_k (x_k+1) + \beta_l x_l) \\ |
| & \qquad - g( \beta_1 x_1 + \dots + \beta_k x_k + \beta_l x_l) |
| \end{align*} |
| \item Centered unit change: |
| \begin{gather*} |
| \frac{\partial y}{\partial x_k} = P(y=1|X,x_k+0.5) - P(y=1|X, x_k-0.5) \\ |
| \end{gather*} |
| \item Standard deviation change: |
| \begin{gather*} |
| \frac{\partial y}{\partial x_k} = P(y=1|X,x_k+0.5\delta_k) - P(y=1|X, x_k-0.5\delta_k) \\ |
| \end{gather*} |
| \item Min-max change: |
| \begin{gather*} |
| \frac{\partial y}{\partial x_k} = P(y=1|X,x_k=x_k^{max}) - P(y=1|X, x_k=x_k^{min}) \\ |
| \end{gather*} |
| \end{enumerate} |
| % subsection discrete_change_effect (end) |
| |
| \subsection{Categorical variables} % (fold) |
| \label{sub:categorical_variables} |
| Categorical variables are converted to dummy variables to use in the regression. |
| The regression method treats the dummy variable as a continuous variable and |
| returns a coefficient for each dummy variable. |
| To compute marginal effect for each such dummy variable, similar to the |
| regression, we can treat it as a continuous variable, and |
| compute the partial derivative of the response variable with respect to this variable. |
| However, since these variables are discrete, using the partial derivative is inappropriate. |
| |
| An alternative method is to compute the discrete change for each dummy variable with |
| respect to the reference level of the categorical variable. If $x_k$ is a dummy |
| variable corresponding to the value $v$ of a categorical variable and let $x_l$ |
| represent another dummy variable representing value $w$ of the same categorical |
| variable. The discrete difference with respect to $x_k$ is defined as: |
| \begin{align*} |
| \Delta_{x_k} y = y_{k}^{set} - y_{k}^{unset} |
| \end{align*} |
| where, |
| \begin{align*} |
| y_{k}^{set} &= \beta_1 f_1 + \ldots + \beta_k f_k(x_k=1) + \ldots + \beta_l f_l(x_l=0) + \ldots, \\ |
| & \text{and} \\ |
| y_{k}^{unset} &= \beta_1 f_1 + \ldots + \beta_k f_k(x_k=0) + \ldots + \beta_l f_l(x_l=0) + \ldots. |
| % \begin{cases} |
| % 1, & \text{if}\ a=1 \\ |
| % 0, & \text{otherwise} |
| % \end{cases} |
| \end{align*} |
| % \left.y\right\vert_{x_k^{(v)}=1, x_k^{(w)}=0\ (w\neq v)} \\ |
| % & \text{or} \\ |
| % - \left.y\right\vert_{x_k^{(0)}=1, x_k^{(w)}=0 \ (w \neq 0)}\ |
| % &= P_{set} - P_{unset}, |
| |
| If the $y$ expression contains dummy variable ($x_p$) corresponding to the |
| reference level of the categorical variable (represented as $r$), then that |
| variable will be unset ($x_p = 0$) in $y^{set}$ and set ($x_p = 1$) in $y^{unset}$. |
| Note that in many cases, the dummy variable for the reference level does not |
| appear in the regression model and unsetting the variable for value $w$ |
| when $w\neq r$ is enough in the second term ($y^{unset}$). |
| |
| In MADlib, we only support the discrete difference method for dummy |
| variables. Let's walk through the computation of marginal effect for \verb|color_blue| when |
| the input for the regression is |
| \begin{verbatim} |
| array[1, color_blue, color_green, degree_college, degree_college * color_blue, |
| degree_college * color_green, gpa, gpa^2, degree_college * gpa, |
| degree_college * gpa^2, weight] |
| \end{verbatim} |
| for a specific row of the data. Here \verb|color_blue| and \verb|color_green| |
| belong to the same categorical variable, with \verb|color_red| being the |
| reference variable for it (not included in the regressor list). |
| |
| \begin{enumerate} |
| \item The value of \verb|color_blue| would be set equal to 1 |
| \item The value of \verb|color_blue| in its interaction terms would also be set to 1 \\ |
| (i.e. \verb|degree_college * color_blue = degree_college|) |
| \item The value of \verb|color_green| would be set equal to 0 |
| \item The value of \verb|color_green| in its interaction terms would also be set to 0 \\ |
| (i.e. \verb|degree_college * color_green = 0|) |
| \item Compute resulting predicted values of response variable ($y^{set}$) |
| \item Set value of \verb|color_blue| equal to 0 |
| \item Set value of \verb|color_blue| in its interaction terms to 0 \\ |
| (i.e. \verb|degree_college * color_blue = 0|) |
| \item Set value of \verb|color_green| equal to 0 and also in its interaction terms \\ |
| (i.e. \verb|degree_college * color_green = 0|) |
| \item Compute resulting predicted values of response variable ($y^{unset}$) |
| \item Compute ($y^{set} - y^{unset}$) |
| \end{enumerate} |
| % subsection categorical_variables (end) |
| |
| \subsection{AME vs MEM} % (fold) |
| \label{sub:ame_vs_mem} |
| There are two main methods of calculating the marginal effects for dependent variables. |
| \begin{enumerate} |
| \item The first uses the average of the marginal effects at every sample |
| observation (AME). |
| % This is calculated as follows: |
| % \begin{gather*} |
| % \langle \frac{\partial P(y_i = 1)}{\partial x_k} \rangle = \frac{1}{n}\sum_{i=1}^{n} P(y_i = 1)(1-P(y_i = 1))\cdot\frac{\partial z_i}{\partial x_k}, \\ |
| % \text{where, } P(y_i=1) = \frac{1}{1 + e^{-z_i}}, \\ |
| % \text{and, } z_i = F(X_i)^T\beta \\ |
| % \end{gather*} |
| |
| \item The second approach calculates the marginal effect for $x_k$ by taking |
| predicted probability calculated when all regressors are held at their mean |
| value from the same formulation with the exception of variable under |
| consideration (MEM). |
| |
| % The derivation of this marginal effect is captured by the following: |
| % \begin{gather*} |
| % \left.\frac{\partial P(y=1)}{\partial |
| % x_k}\right\vert_{X=\bar{X}} = \quad |
| % P(y=1|\bar{X})(1-P(y=1|\bar{X})) |
| % \left.\frac{\partial z}{\partial x_k}\right\vert_{X=\bar{X}} \\ |
| % \text{where, } \bar{X} = \frac{\sum_{i=1}^{n}X_i}{n} |
| % \end{gather*} |
| \end{enumerate} |
| |
| It is generally viewed to be problematic to evaluate |
| marginal effects at means (MEM) of dummy variables since means of dummies refer |
| to nonexisting observations. In MADlib, we currently only provide the option to |
| compute the average marginal effects (AME). In future versions, we aim to |
| provide various options including marginal effects at means and marginal |
| effects at a representative value. We currently do |
| provide the option to compute the margial effect on a separate dataset, which |
| could be used as a workaround by including just the mean value or the representative value. |
| % subsection ame_vs_mem (end) |
| |
| \subsection{Marginal effects for regression methods} % (fold) |
| \label{sub:marginal_effects_for_regression_methods} |
| |
| Below we derive the marginal effects for the various regression methods in MADlib. |
| We assume a general formulation of the output $y$ as |
| $$ |
| y = g(\beta_1 f_1 + \beta_2 f_2 + \ldots + \beta_{N} f_{N}). |
| $$ |
| As stated above, each $f_i$ is a function of all the base variables $x_1, x_2, \ldots, x_M$ and |
| describes the interaction between the base variables. In the simple |
| (non-interaction) case, $f_i = x_i$. |
| |
| Let's represent the vector of coefficients as |
| $$ |
| \vec{\beta} = |
| \begin{bmatrix} |
| \beta_1 \\ |
| \vdots \\ |
| \beta_N \\ |
| \end{bmatrix}, |
| $$ |
| and the partial derivative of the $x$ terms in the output |
| with respect to each variable $x_i$ (Jacobian matrix) as |
| $$J = |
| \begin{bmatrix} |
| \pder[f_1]{x_1}, & \pder[f_1]{x_2}, & \pder[f_1]{x_3}, & \ldots, & \pder[f_1]{x_M}\\[7pt] |
| \pder[f_2]{x_1}, & \pder[f_2]{x_2}, & \pder[f_2]{x_3}, & \ldots, & \pder[f_2]{x_M}\\[7pt] |
| \pder[f_3]{x_1}, & \pder[f_3]{x_2}, & \pder[f_3]{x_3}, & \ldots, & \pder[f_3]{x_M}\\[7pt] |
| & & \vdots & & \\[7pt] |
| \pder[f_N]{x_1}, & \pder[f_N]{x_2}, & \pder[f_N]{x_3}, & \ldots, & \pder[f_N]{x_M}\\[7pt] |
| \end{bmatrix} |
| $$ |
| |
| In the above equation, for each element $J_{n,m}$ the column |
| corresponds to a base varnable ($x_m$) and the row corresponds to a term in |
| the output expression ($f_n$). |
| Let $\nabla f_n$ denote the $n$-th row of $J$ (i.e. gradient of $f_n$), |
| and let $J(m) = \pder[\vec{f}]{x_m}$ denote the $m$-th column of $J$, where $\vec{f} = (f_1, \ldots f_N)^T$. |
| If $x_k$ is a categorical variable and if discrete differences are required |
| for it, then the column corresponding to $x_k$ can be replaced by, |
| \begin{align*} |
| \pder[\vec{f}]{x_k} &= \vec{f^{set_k}} - \vec{f^{unset_k}} \\ |
| &= \begin{bmatrix} |
| f_0^{set_k} , & f_1^{set_k}, & f_2^{set_k} , & \ldots, & f_{N - 1}^{set_k} |
| \end{bmatrix}^T - |
| \begin{bmatrix} |
| f_0^{unset_k} , & f_1^{unset_k}, & f_2^{unset_k} , & \ldots, & f_{N - 1}^{unset_k} |
| \end{bmatrix}^T \\ |
| &= \begin{bmatrix} |
| f_0^{set_k} - f_0^{unset_k}, & f_1^{set_k} - f_1^{unset_k}, & |
| f_2^{set_k} - f_2^{unset_k}, & \ldots, & f_{N - 1}^{set_k} - f_{N - 1}^{unset_k} |
| \end{bmatrix}^T, |
| \end{align*} |
| where |
| \begin{align*} |
| & f_i^{set_k} = f_i(\ldots, x_k=1, x_l=0, x_r=0) \\ |
| & \text{and} \\ |
| & f_i^{unset_k} = f_i(\ldots, x_k=0, x_l=0, x_r=1), |
| \end{align*} |
| |
| $\forall x_l \in$ (set of dummy variables related to $x_k$ excluding the |
| reference variable) and $x_r = $ reference variable of $x_k$ (if present). The |
| response probability corresponding to $f^{set_k}$ is denoted as $P^{set_k}$. |
| |
| |
| \subsubsection{Linear regression} % (fold) |
| \label{ssub:linear_regression} |
| For the linear regression equation of the form $y = \vec{f}^T \vec{\beta}$, |
| the marginal effect for each variable ($x_i$) is the same as the coefficient for |
| that variable ($\beta_i$). When interaction effects are present, the marginal effect |
| will be, |
| $$ |
| \mathit{ME} = J^T \vec{\beta} |
| $$ |
| % ssubsection linear_regression (end) |
| \subsubsection{Logistic regression} % (fold) |
| \label{ssub:logistic_regression} |
| In logistic regression: |
| \begin{align*} |
| P &= \frac{1}{1 + e^{-\vec{f}^T \vec{\beta}}} \\ |
| &= \frac{1}{1 + e^{-z}} |
| \end{align*} |
| \begin{align*} |
| \implies \frac{\partial P}{\partial X_k} &= P \cdot (1-P) \cdot |
| \frac{\partial z}{\partial x_k}, |
| \end{align*} |
| where the partial derivative in the last equation equals to $\beta_k$ |
| if there is no interaction terms. Thus the marginal effect for all variables |
| presented as a vector will be, |
| $$ |
| \mathit{ME} = P \cdot (1-P) \cdot J^T \vec{\beta} |
| $$ |
| |
| For categorical variables, we compute the discrete difference as described in |
| \ref{sub:categorical_variables}. |
| |
| For variable $x_k$: |
| $$ |
| \mathit{ME_k} = P^{set} - P^{unset} |
| $$ |
| |
| % subsubsection logistic_regression (end) |
| \subsubsection{Multilogistic regression} % (fold) |
| \label{ssub:multilogistic_regression} |
| The probabilities of different outcomes for multilogistic regression are expressed as, |
| \begin{gather*} |
| P^l = P(y=l | \vec{x}) |
| = \frac{e^{\vec{f}^T \vec{\beta^l}}}{1 + \displaystyle \sum_{q=1}^{L-1} e^{\vec{f}^T \vec{\beta^q}}} |
| = \frac{e^{\vec{f}^T \vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}}, |
| \end{gather*} |
| where $\vec{\beta^l}$ represents the coefficient vector for category $l$, with |
| $L$ being the total number of categories. The coefficients are set to zero for |
| one of the outcomes, called the ``base outcome'' or the ``reference category''. |
| Here, without loss of generality, we let $\vec{\beta^L} = \vec{0}$. |
| |
| % The odds of outcome $j$ verus outcome $m$ are |
| % \begin{align*} |
| % \frac{P(y=j | X)}{P(y=m | X)} & = \frac{\frac{e^{X\beta_j}}{\sum_{l=1}^{j} e^{X\beta_l}}} |
| % {\frac{e^{X\beta_m}}{\sum_{l=1}^{j} e^{X\beta_l}}} \\ |
| % & = \frac{e^{X\beta_j}}{e^{X\beta_m}} |
| % \end{align*} |
| % Thus the partial derivative would be, |
| % \begin{gather*} |
| % \frac{\partial ln(P_j/P_m)}{\partial x_k} = \beta_{kj} - \beta_{km} |
| % \end{gather*} |
| |
| % Thus, if $x_k$ is increased by one unit the odds of outcome $j$ against the |
| % base outcome changes by exp($\beta_{kj}$) (since $\beta_{km}=0$ for the base |
| % outcome). Expanding $P_m$ we finally get, |
| Thus, |
| \begin{align*} |
| \displaystyle |
| \pder[P^l]{x_m} &= |
| \frac{1}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}} |
| \pder[e^{\vec{f}^T \vec{\beta^l}}]{x_m} - |
| \frac{e^{\vec{f}^T \vec{\beta^l}} } |
| {\displaystyle \left(\sum_{p=1}^{L} e^{\vec{f}^T \vec{\beta^p}}\right)^2} \cdot |
| \pder{x_m} \displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}} \\[5pt] |
| &= \frac{e^{\vec{f}^T \vec{\beta^l}}} |
| {\displaystyle \sum_{q=1}^{L} e^{\vec{f}^T \vec{\beta^q}}} \pder[\vec{f}^T \vec{\beta^l}]{x_m} - |
| \frac{ e^{\vec{f}^T \vec{\beta^l}} } |
| {\displaystyle \sum_{p=1}^{L} e^{\vec{f}^T \vec{\beta^p}} } |
| \quad \displaystyle |
| \mathlarger{\sum_{q=1}^{L}} |
| \frac{ e^{\vec{f}^T \vec{\beta^q}} } |
| {\displaystyle \sum_{p=1}^{L} e^{\vec{f}^T \vec{\beta^p}}} |
| \pder[\vec{f}^T \vec{\beta^q}]{x_m} \\[5pt] |
| \end{align*} |
| |
| Hence, for every $m$-th variable $x_m$, we have the marginal effect for category |
| $l$ as, |
| \begin{gather*} |
| \mathit{ME}_m^l = P^l\left(\displaystyle J(m)^T \vec{\beta^l} - |
| \sum_{q=1}^{L-1} P^q \cdot J(m)^T \vec{\beta^q} \right). |
| \end{gather*} |
| Vectorizing the above equation, we get |
| $$ |
| \mathit{ME}^l = P^l \left(\displaystyle J^T \vec{\beta^l} - J^T B \vec{p} \right), |
| $$ |
| where $\vec{p} = (P^1, \ldots, P^{L-1})^T$ is a column vector and $B = (\vec{\beta}^1, \dots, \vec{\beta}^{L-1})$ is a $N \times (L-1)$ matrix. |
| Finally, we can simplify the computation of the marginal effects matrix as |
| $$ |
| \mathit{ME} = J^T B \mathit{diag}(\vec{p}) - J^T B \vec{p} \vec{p}^T. |
| $$ |
| |
| Once again, for categorical variables, we compute the discrete difference |
| as described in \ref{sub:categorical_variables}. |
| For categorical variable $x_k$, the $k$-th row of $\mathit{ME}$ will be, |
| $$ |
| \mathit{ME_k} = \left( \vec{p^{set_k}} - \vec{p^{unset_k}} \right)^T |
| $$ |
| |
| % subsubsection multilogistic_regression (end) |
| % subsection marginal_effects_for_regression_methods (end) |
| |
| \subsection{Standard Errors} % (fold) |
| \label{sub:standard_errors} |
| The delta method is a popular way to estimate standard errors of non-linear |
| functions of model parameters. While it is straightforward to calculate the |
| variance of a linear function of a random variable, it is not for a nonlinear |
| function. The delta method therefore relies on finding a linear approximation |
| of the function by using a first-order Taylor expansion. |
| |
| We can approximate a function $g(x)$ about a value $a$ as, |
| \[ g(x) \approx g(a) + (x-a)g'(a) \] |
| |
| Taking the variance and setting $a = \mu_x$, |
| \[ Var(g(X)) \approx \left[g'(\mu_x)\right]^2 Var(X) \] |
| |
| \subsubsection*{Linear Regression} % (fold) |
| \label{ssub:std_linear_regression} |
| Using this technique, to compute the variance of the marginal effects |
| at a given observation value in \emph{linear regression}, we obtain |
| the standard error by first computing the marginal effect's derivative |
| over the coefficients, which is a $M\times N$ matrix $S_{mn} |
| = \frac{\partial \mathit{ME}_m}{\partial \beta_n}$ |
| |
| \begin{align*} |
| \vec{S} &= \pder[J^T \vec{\beta}]{\vec{\beta}} \\ |
| &= J^T |
| \end{align*} |
| |
| Using the delta method we can then compute the variance of the marginal effects |
| as, |
| \begin{align*} |
| Var(\mathit{ME}) = S \cdot Var(\beta)\cdot S^T\, |
| \end{align*} |
| where $Var(\beta)$ is a $N\times N$ matrix, $S$ is a $M\times N$ |
| matrix, $M$ is the number of base variables, and $N$ is the total number of |
| terms in independent variable list (i.e. length of $\vec{\beta}$). |
| |
| Note: The $Var(\vec{\beta})$ is computed using the training data |
| employed by the underlying regression, but not the data used to compute the |
| marginal effects. The delta matrix (\vec{S}) is computed over the |
| data for which marginal effects is desired (averaged over the data for AME). |
| % subsubsection linear_regression (end) |
| |
| \subsubsection*{Logistic Regression} |
| Similar to linear regression, we obtain the variance matrix by first computing |
| the delta matrix, $\vec{S}$: |
| \begin{eqnarray*} |
| S_{mn} &=& \frac{\partial}{\partial\beta_n} \left[P (1- P) |
| \cdot \frac{\partial z}{\partial x_m}\right]\\ |
| &=& P (1- P) \cdot \frac{\partial}{\partial\beta_n}\left(\frac{\partial z}{\partial x_m}\right) |
| + \frac{\partial \left[P (1- P)\right]}{\partial\beta_n} \cdot \frac{\partial z}{\partial x_m}\\ |
| &=& P(1-P)\cdot\frac{\partial^2 z}{\partial x_m\partial\beta_n} |
| + P(1-P)(1-2P) \cdot \frac{\partial z}{\partial \beta_n} \cdot \frac{\partial z}{\partial x_m},\\ |
| \text{where } P &=& \frac{1}{1 + e^{-z}} \\ |
| \text{and } z &=& \vec{f}^T \vec{\beta}. \\ |
| \end{eqnarray*} |
| Using the definition of $z$, we can simplify $S$ a little bit |
| \begin{equation*} |
| S_{mn} = P(1-P)\left(\frac{\partial f_n}{\partial x_m} + (1-2P)\cdot f_n \cdot J(m)^T \vec{\beta} \right) |
| \end{equation*} |
| |
| Thus, $n$-th column of $S$ |
| $$ |
| c_n(S) = P(1-P)\left[\nabla f_n^T + (1-2P) \cdot f_n \cdot J^T \vec{\beta}\right]. |
| $$ |
| |
| Vectorizing this equation to express for the complete matrix, |
| \begin{equation*} |
| \vec{S} = P(1-P) \left(J^T + |
| (1-2P) \cdot (J^T \vec{\beta}) \vec{f}^T \right) |
| \end{equation*} |
| |
| And for categorical variables, we replace $P(1-P)\cdot(\partial z/\partial x_m)$ |
| with $\Delta_{x_m}P$ in $S_{mn}$, we get |
| \begin{eqnarray*} |
| S_{mn} &=& \frac{\partial(P^{set}-P^{unset})}{\partial\beta_n} \\ |
| &=& P^{set} (1 - P^{set}) \cdot f_n^{set} - P^{unset} (1 - P^{unset}) \cdot f_n^{unset} |
| \end{eqnarray*} |
| |
| Similar to linear regression, we can then compute the variance of the marginal effects as, |
| \begin{align*} |
| Var(\mathit{ME}) = S \cdot Var(\vec{\beta})\cdot S^T\, |
| \end{align*} |
| where $Var(\vec{\beta})$ is computed on the original training dataset, while |
| \vec{S} is averaged over the dataset on which marginal effect is desired |
| (could be just a single datapoint). |
| |
| \subsubsection*{Multinomial Logistic Regression} |
| |
| For multinomial logistic regression, the coefficients $\vec{\beta}$ form a matrix of |
| dimension $N \times (L - 1)$ where $L$ is the number of categories and $N$ is the |
| number of features (including interaction terms). In order to compute the |
| standard errors on the marginal effects of category $l$ for independent variable |
| $x_m$, we need to compute the term $\pder[\mathit{ME}_{m}^{l}]{\beta_{n}^{l'}}$ |
| for each $l' \in \{1 \ldots (L-1) \}$ and $n \in \{1 \ldots N \}$. Note that |
| $m$ here is restricted to be in the set $\{1 \ldots M\}$. |
| The result is a column vector of length $(L-1) \times N$ denoted by |
| $\pder[\mathit{ME}_{m}^{l}]{\vec{\beta}}$. Hence, for each category |
| $l \in \{1 \ldots (L-1)\}$ and independent variable $m \in \{1 \ldots M\}$, we |
| perform the following computation |
| |
| \begin{equation} |
| Var(\mathit{ME}_m^l) = \pder[\mathit{ME}_m^l]{\vec{\beta}}^T V \pder[\mathit{ME}_m^l]{\vec{\beta}}, |
| \end{equation} |
| |
| where $V$ is the variance-covariance matrix of the multinomial logistic |
| regression vectorized in the \textbf{same order} in which the partial derivative |
| is vectorized. |
| |
| From our earlier derivation, we know that the marginal effect for multinomial |
| logistic regression for the $m$-th index of data vector $\vec{x}$ is given as: |
| \begin{gather*} |
| \mathit{ME}_m^l = P^l \left[ J(m)^T \vec{\beta^l} - |
| \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] |
| \end{gather*} |
| where |
| \begin{gather*} |
| P^l = P(y=l| \vec{x}) = \frac{e^{\vec{f}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f}\vec{\beta^q}}} |
| \qquad \forall l \in \{ 1 \ldots (L-1) \}. |
| \end{gather*} |
| |
| We now compute the term $\pder[\mathit{ME}_m^l]{\vec{\beta}}$. First, |
| we define the indicator function, $\delta$, as: |
| |
| \begin{equation*} |
| \delta_{i,j} = \begin{cases} 1 & \mbox{if } i=j \\ |
| 0 & \mbox{otherwise} \end{cases}. |
| \end{equation*} |
| |
| We can show that for each $l' \in \{ 1 \ldots (L-1) \}$ and $n \in \{1 \ldots N\}$, |
| the partial derivative will be |
| \begin{align*} |
| \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &= |
| \pder[P^l]{\beta_n^{l'}} |
| \left[ J(m)^T \vec{\beta^l} - |
| \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} |
| \right] + |
| P^l \left[ \pder[]{\beta_n^{l'}}(J(m)^T \vec{\beta^l}) - |
| \pder[]{\beta_n^{l'}} \left( \sum_{q=1}^{L-1} P^q J(m)^T \beta^q \right) |
| \right] |
| \end{align*} |
| where |
| \begin{align*} |
| \pder[P^l]{\beta_n^{l'}} &= P^l f_n (\delta_{l,l'} - P^{l'}) |
| \end{align*} |
| |
| The expression above can be simplified to obtain |
| \begin{align*} |
| \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &= |
| P^l f_n (\delta_{l,l'} - P^{l'}) |
| \left[ J(m)^T \vec{\beta^l} - |
| \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} |
| \right] + \\ |
| & \qquad P^l \left[ \delta_{l,l'} \pder[f_n]{x_m} - P^{l'} f_n J(m)^T \vec{\beta^{l'}} + |
| P^{l'} f_n \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} - |
| P^{l'} \pder[f_n]{x_m} |
| \right] \\ |
| &= f_n (\delta_{l,l'} - P^{l'}) \cdot P^l \left[J(m)^T\vec{\beta^l} - |
| \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right] + \\ |
| & \qquad P^l \left[\delta_{l,l'} \pder[f_n]{x_m} - |
| f_n \cdot P^{l'} |
| \left( J(m)^T \vec{\beta^{l'}} - \sum_{q=1}^{L-1} P^q J(m)^T \vec{\beta^q} \right) |
| - P^{l'} \pder[f_n]{x_m} |
| \right] \\[7pt] |
| &= f_n (\delta_{l, l'} - P^{l'})\mathit{ME}_m^l + |
| P^l \left[ \delta_{l, l'} \pder[f_n]{x_m} - f_n \mathit{ME}_m^{l'} - P^{l'} \pder[f_n]{x_m} \right]. |
| \end{align*} |
| |
| Again, the above computation is performed for every $l \in \{1 \ldots (L-1)\}$ (base outcome is skipped) |
| and every $m \in \{1 \ldots M\}$, with each computation returning a column vector |
| of size $(L-1) \times N$. |
| |
| For categorical variables, we use the discrete difference value of $\mathit{ME}_m^l$ |
| to compute the standard error as, |
| |
| \begin{eqnarray*} |
| \pder[\mathit{ME}_m^l]{\beta_n^{l'}} &=& |
| \pder[P^{set_m, l}]{\beta_n^{l'}} - \pder[P^{unset_m, l}]{\beta_n^{l'}} \\[7pt] |
| &=& P^{set_m, l} f_n^{set_m} (\delta_{l,l'} - P^{set_m, l'}) - |
| P^{unset_m, l} f_n^{unset_m} (\delta_{l,l'} - P^{unset_m, l'}) \\[7pt] |
| &=& - \left(P^{set_m, l} f_n^{set_m} P^{set_m, l'} - |
| P^{unset_m, l} f_n^{unset_m} P^{unset_m, l'} \right) + \\ |
| && \qquad \delta_{l,l'} \left(P^{set_m, l} f_n^{set_m} - P^{unset_m, l} f_n^{unset_m} \right), |
| \end{eqnarray*} |
| where |
| \begin{align*} |
| & P^{set_m, l} = \frac{e^{\vec{f^{set_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{set_m}}\vec{\beta^q}}} \\ |
| & \text{and} \\ |
| & P^{unset_m, l} = \frac{e^{\vec{f^{unset_m}}\vec{\beta^l}}}{\displaystyle \sum_{q=1}^{L} e^{\vec{f^{unset_m}}\vec{\beta^q}}} \qquad \qquad \forall l \in \{ 1 \ldots (L-1) \}. |
| \end{align*} |
| |
| |
| |
| % subsection standard_errors (end) |
| % section marginal_effects (end) |
| |
| |
| \section{Clustered Standard Errors} % (fold) |
| \label{sec:clustered_standard_errors} |
| Adjusting standard errors for clustering can be important. For |
| example, replicating a dataset 100 times should not increase the |
| precision of parameter estimates. However, performing this procedure |
| with the IID assumption will actually do this. Another example is in |
| economics of education research, it is reasonable to expect that the |
| error terms for children in the same class are not |
| independent. Clustering standard errors can correct for this. |
| |
| \subsection{Overview of Clustered Standard Errors} |
| |
| Assume that the data can be separated into $m$ clusters. Usually this |
| can be down by grouping the data table according to one or multiple |
| columns. |
| |
| The estimator has a similar form to the usual sandwich estimator |
| \begin{equation} |
| S(\vec{\beta}) = B(\vec{\beta}) M(\vec{\beta}) B(\vec{\beta}) |
| \end{equation} |
| |
| The bread part is the same as sandwich estimator |
| \begin{eqnarray} |
| B(\vec{\beta}) & = & \left(-\sum_{i=1}^{n} H(y_i, \vec{x}_i, |
| \vec{\beta})\right)^{-1}\\ |
| & = & \left(-\sum_{i=1}^{n}\frac{\partial^2 l(y_i, \vec{x}_i, |
| \vec{\beta})}{\partial \beta_\alpha \partial \beta_\beta}\right)^{-1} |
| \end{eqnarray} |
| where $H$ is the hessian matrix, which is the second derivative of the |
| target function |
| \begin{equation} |
| L(\vec{\beta}) = \sum_{i=1}^n l(y_i, \vec{x}_i, \vec{\beta})\ . |
| \end{equation} |
| |
| The meat part is different |
| \begin{equation} |
| M(\vec{\beta}) = {A}^T{A} |
| \end{equation} |
| where the $m$-th row of ${A}$ is |
| \begin{equation} |
| A_m = \sum_{i\in G_m}\frac{\partial |
| l(y_i,\vec{x}_i,\vec{\beta})}{\partial \vec{\beta}} |
| \end{equation} |
| where $G_m$ is the set of rows that belong to the same cluster. |
| |
| \subsection{Implementation} |
| |
| We can compute the quantities of $B$ and $A$ for each cluster during one scan |
| through the data table in an aggregate function. Then sum over all clusters to |
| the full $B$ and $A$ in the outside of the aggregate function. At last, the |
| matrix mulplitications are done in a separate function on master. |
| % section clustered_standard_errors (end) |