blob: 807553a1db00c63858eae35c392ae8c6e9b8650d [file] [log] [blame]
<!-- HTML header for doxygen 1.8.4-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.4"/>
<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
<title>MADlib: Cox-Proportional Hazards Regression</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
$(document).ready(initResizable);
$(window).load(resizeHeight);
</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { searchBox.OnSelectItem(0); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script src="../mathjax/MathJax.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
<!-- google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-45382226-1', 'auto');
ga('send', 'pageview');
</script>
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><a href="http://madlib.incubator.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
<td style="padding-left: 0.5em;">
<div id="projectname">
<span id="projectnumber">1.6</span> <span style="font-size:10pt; font-style:italic"><a href="../latest/./group__grp__cox__prop__hazards.html"> A newer version is available</a></span>
</div>
<div id="projectbrief">User Documentation</div>
</td>
<td> <div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.4 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
$(document).ready(function(){initNavTree('group__grp__cox__prop__hazards.html','');});
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
<a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(0)"><span class="SelectionMark">&#160;</span>All</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(1)"><span class="SelectionMark">&#160;</span>Files</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(2)"><span class="SelectionMark">&#160;</span>Functions</a><a class="SelectItem" href="javascript:void(0)" onclick="searchBox.OnSelectItem(3)"><span class="SelectionMark">&#160;</span>Groups</a></div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Cox-Proportional Hazards Regression<div class="ingroups"><a class="el" href="group__grp__glm.html">Generalized Linear Models</a></div></div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><b>Contents</b> </p>
<ul>
<li class="level1">
<a href="#training">Training Function</a> </li>
<li class="level1">
<a href="#cox_zph">PHA Test Function</a> </li>
<li class="level1">
<a href="#examples">Examples</a> </li>
<li class="level1">
<a href="#background">Technical Background</a> </li>
<li class="level1">
<a href="#related">Related Topics</a> </li>
</ul>
</div><p>Proportional-Hazard models enable the comparison of various survival models. These survival models are functions describing the probability of a one-item event (prototypically, this event is death) with respect to time. The interval of time before death occurs is the survival time. Let T be a random variable representing the survival time, with a cumulative probability function P(t). Informally, P(t) is the probability that death has happened before time t.</p>
<p><a class="anchor" id="training"></a></p>
<dl class="section user"><dt>Training Function</dt><dd></dd></dl>
<p>Following is the syntax for the <a class="el" href="cox__prop__hazards_8sql__in.html#a737450bbfe0f10204b0074a9d45b0cef" title="Compute cox-regression coefficients and diagnostic statistics. ">coxph_train()</a> training function: </p>
<pre class="syntax">
coxph_train( source_table,
output_table,
dependent_variable,
independent_variable,
right_censoring_status,
strata,
optimizer_params
)
</pre><p> <b>Arguments</b> </p>
<dl class="arglist">
<dt>source_table </dt>
<dd>TEXT. The name of the table containing input data. </dd>
<dt>output_table </dt>
<dd><p class="startdd">TEXT. The name of the table where the output model is saved. The output is saved in the table named by the <em>output_table</em> argument. It has the following columns: </p>
<table class="output">
<tr>
<th>coef </th><td>FLOAT8[]. Vector of the coefficients. </td></tr>
<tr>
<th>loglikelihood </th><td>FLOAT8. Log-likelihood value of the MLE estimate. </td></tr>
<tr>
<th>std_err </th><td>FLOAT8[]. Vector of the standard error of the coefficients. </td></tr>
<tr>
<th>stats </th><td>FLOAT8[]. Vector of the statistics of the coefficients. </td></tr>
<tr>
<th>p_values </th><td>FLOAT8[]. Vector of the p-values of the coefficients. </td></tr>
<tr>
<th>hessian </th><td>FLOAT8[]. The Hessian matrix computed using the final solution. </td></tr>
<tr>
<th>num_iterations </th><td>INTEGER. The number of iterations performed by the optimizer. </td></tr>
</table>
<p>Additionally, a summary output table is generated that contains a summary of the parameters used for building the Cox model. It is stored in a table named &lt;output_table&gt;_summary. It has the following columns: </p>
<table class="output">
<tr>
<th>source_table </th><td>The source table name. </td></tr>
<tr>
<th>dependent_variable </th><td>The dependent variable name. </td></tr>
<tr>
<th>independent_variable </th><td>The independent variable name. </td></tr>
<tr>
<th>right_censoring_status </th><td>The right censoring status </td></tr>
<tr>
<th>strata </th><td>The stratification columns </td></tr>
<tr>
<th>num_processed </th><td>The number of rows that were actually used in the computation. </td></tr>
<tr>
<th>num_missing_rows_skipped </th><td>The number of rows that were skipped in the computation due to NULL values in them. </td></tr>
</table>
<p class="enddd"></p>
</dd>
<dt>dependent_variable </dt>
<dd>TEXT. A string containing the name of a column that contains an array of numeric values, or a string expression in the format 'ARRAY[1, x1, x2, x3]', where <em>x1</em>, <em>x2</em> and <em>x3</em> are column names. Dependent variables refer to the time of death. There is no need to pre-sort the data. </dd>
<dt>independent_variable </dt>
<dd>TEXT. The name of the independent variable. </dd>
<dt>right_censoring_status (optional) </dt>
<dd>TEXT, default: TRUE for all observations. A string containing an expression that evaluates to the right-censoring status for the observation&mdash;TRUE if the observation is not censored and FALSE if the observation is censored. The string could contain the name of the column containing the right-censoring status, a fixed Boolean expression (i.e., 'true', 'false', '0', '1') that applies to all observations, or a Boolean expression such as 'column_name &lt; 10' that can be evaluated for each observation. </dd>
<dt>strata (optional) </dt>
<dd>VARCHAR, default: NULL, which does not do any stratifications. A string of comma-separated column names that are the strata ID variables used to do stratification. </dd>
<dt>optimizer_params (optional) </dt>
<dd>VARCHAR, default: NULL, which uses the default values of optimizer parameters: max_iter=100, optimizer=newton, tolerance=1e-8, array_agg_size=10000000, sample_size=1000000. It should be a string that contains 'key=value' pairs separated by commas. The meanings of these parameters are:<ul>
<li>max_iter &mdash; The maximum number of iterations. The computation stops if the number of iterations exceeds this, which usually means that there is no convergence.</li>
<li>optimizer &mdash; The optimization method. Right now, "newton" is the only one supported.</li>
<li>tolerance &mdash; The stopping criteria. When the difference between the log-likelihoods of two consecutive iterations is smaller than this number, the computation has already converged and stops.</li>
<li>array_agg_size &mdash; To speed up the computation, the original data table is cut into multiple pieces, and each pieces of the data is aggregated into one big row. In the process of computation, the whole big row is loaded into memory and thus speed up the computation. This parameter controls approximately how many numbers we want to put into one big row. Larger value of array_agg_size may speed up more, but the size of the big row cannot exceed 1GB due to the restriction of PostgreSQL databases.</li>
<li>sample_size &mdash; To cut the data into approximate equal pieces, we first sample the data, and then find out the break points using this sampled data. A larger sample_size produces more accurate break points. </li>
</ul>
</dd>
</dl>
<p><a class="anchor" id="cox_zph"></a></p>
<dl class="section user"><dt>Proportional Hazards Assumption Test Function</dt><dd></dd></dl>
<p>The <a class="el" href="cox__prop__hazards_8sql__in.html#a682d95d5475ce33e47937067cadc2766" title="Test the proportional hazards assumption for a Cox regression model fit (coxph_train) ...">cox_zph()</a> function tests the proportional hazards assumption (PHA) of a Cox regression.</p>
<p>Proportional-hazard models enable the comparison of various survival models. These PH models, however, assume that the hazard for a given individual is a fixed proportion of the hazard for any other individual, and the ratio of the hazards is constant across time. MADlib does not currently have support for performing any transformation of the time to compute the correlation.</p>
<p>The <a class="el" href="cox__prop__hazards_8sql__in.html#a682d95d5475ce33e47937067cadc2766" title="Test the proportional hazards assumption for a Cox regression model fit (coxph_train) ...">cox_zph()</a> function is used to test this assumption by computing the correlation of the residual of the <a class="el" href="cox__prop__hazards_8sql__in.html#a737450bbfe0f10204b0074a9d45b0cef" title="Compute cox-regression coefficients and diagnostic statistics. ">coxph_train()</a> model with time.</p>
<p>Following is the syntax for the <a class="el" href="cox__prop__hazards_8sql__in.html#a682d95d5475ce33e47937067cadc2766" title="Test the proportional hazards assumption for a Cox regression model fit (coxph_train) ...">cox_zph()</a> function: </p>
<pre class="syntax">
cox_zph(cox_model_table, output_table)
</pre><p> <b>Arguments</b> </p>
<dl class="arglist">
<dt>cox_model_table </dt>
<dd><p class="startdd">TEXT. The name of the table containing the Cox Proportional-Hazards model.</p>
<p class="enddd"></p>
</dd>
<dt>output_table </dt>
<dd>TEXT. The name of the table where the test statistics are saved. The output table is named by the <em>output_table</em> argument and has the following columns: <table class="output">
<tr>
<th>covariate </th><td>TEXT. The independent variables. </td></tr>
<tr>
<th>rho </th><td>FLOAT8[]. Vector of the correlation coefficients between survival time and the scaled Schoenfeld residuals. </td></tr>
<tr>
<th>chi_square </th><td>FLOAT8[]. Chi-square test statistic for the correlation analysis. </td></tr>
<tr>
<th>p_value </th><td>FLOAT8[]. Two-side p-value for the chi-square statistic. </td></tr>
</table>
</dd>
</dl>
<p>Additionally, the residual values are outputted to the table named <em>output_table</em>_residual. The table contains the following columns: </p>
<table class="output">
<tr>
<th>&lt;dep_column_name&gt; </th><td>FLOAT8. Time values (dependent variable) present in the original source table. </td></tr>
<tr>
<th>residual </th><td>FLOAT8[]. Difference between the original covariate values and the expectation of the covariates obtained from the coxph_train model. </td></tr>
<tr>
<th>scaled_residual </th><td>Residual values scaled by the variance of the coefficients. </td></tr>
</table>
<p><a class="anchor" id="notes"></a></p>
<dl class="section user"><dt>Notes</dt><dd><ul>
<li>Table names can be optionally schema qualified (current_schemas() is used if a schema name is not provided) and table and column names should follow case-sensitivity and quoting rules per the database. For instance, 'mytable' and 'MyTable' both resolve to the same entity&mdash;'mytable'. If mixed-case or multi-byte characters are desired for entity names then the string should be double-quoted; in this case the input would be '"MyTable"'.</li>
<li>The <a class="el" href="cox__prop__hazards_8sql__in.html#a3310cf98478b7c1e400e8fb1b3965d30">cox_prop_hazards_regr()</a> and <a class="el" href="cox__prop__hazards_8sql__in.html#ad778b289eb19ae0bb2b7e02a89bab3bc" title="Cox regression training function. ">cox_prop_hazards()</a> functions have been deprecated; <a class="el" href="cox__prop__hazards_8sql__in.html#a737450bbfe0f10204b0074a9d45b0cef" title="Compute cox-regression coefficients and diagnostic statistics. ">coxph_train()</a> should be used instead.</li>
</ul>
</dd></dl>
<p><a class="anchor" id="examples"></a></p>
<dl class="section user"><dt>Examples</dt><dd><ol type="1">
<li>View online help for the proportional hazards training method. <pre class="example">
SELECT madlib.coxph_train();
</pre></li>
<li>Create an input data set. <pre class="example">
DROP TABLE IF EXISTS sample_data;
CREATE TABLE sample_data (
id INTEGER NOT NULL,
grp DOUBLE PRECISION,
wbc DOUBLE PRECISION,
timedeath INTEGER,
status BOOLEAN
);
COPY sample_data FROM STDIN WITH DELIMITER '|';
0 | 0 | 1.45 | 35 | t
1 | 0 | 1.47 | 34 | t
3 | 0 | 2.2 | 32 | t
4 | 0 | 1.78 | 25 | t
5 | 0 | 2.57 | 23 | t
6 | 0 | 2.32 | 22 | t
7 | 0 | 2.01 | 20 | t
8 | 0 | 2.05 | 19 | t
9 | 0 | 2.16 | 17 | t
10 | 0 | 3.6 | 16 | t
11 | 1 | 2.3 | 15 | t
12 | 0 | 2.88 | 13 | t
13 | 1 | 1.5 | 12 | t
14 | 0 | 2.6 | 11 | t
15 | 0 | 2.7 | 10 | t
16 | 0 | 2.8 | 9 | t
17 | 1 | 2.32 | 8 | t
18 | 0 | 4.43 | 7 | t
19 | 0 | 2.31 | 6 | t
20 | 1 | 3.49 | 5 | t
21 | 1 | 2.42 | 4 | t
22 | 1 | 4.01 | 3 | t
23 | 1 | 4.91 | 2 | t
24 | 1 | 5 | 1 | t
\.
</pre></li>
<li>Run the Cox regression function. <pre class="example">
SELECT madlib.coxph_train( 'sample_data',
'sample_cox',
'timedeath',
'ARRAY[grp,wbc]',
'status'
);
</pre></li>
<li>View the results of the regression. <pre class="example">
\x on
SELECT * FROM sample_cox;
</pre> Results: <pre class="result">
-[ RECORD 1 ]--+----------------------------------------------------------------------------
coef | {2.54407073265254,1.67172094779487}
loglikelihood | -37.8532498733
std_err | {0.677180599294897,0.387195514577534}
z_stats | {3.7568570855419,4.31751114064138}
p_values | {0.000172060691513886,1.5779844638453e-05}
hessian | {{2.78043065745617,-2.25848560642414},{-2.25848560642414,8.50472838284472}}
num_iterations | 5
</pre></li>
<li>View online help for the function to test Proportional Hazards Assumption. <pre class="example">
SELECT madlib.cox_zph();
</pre></li>
<li>Run the test for Proportional Hazards assumption to obtain correlation between residuals and time. <pre class="example">
SELECT madlib.cox_zph( 'sample_cox',
'sample_zph_output'
);
</pre></li>
<li>View results of the PHA test. <pre class="example">
SELECT * FROM sample_zph_output;
</pre> Results: <pre class="result">
-[ RECORD 1 ]-----------------------------------------
covariate | ARRAY[grp,wbc]
rho | {0.00237308357328641,0.0375600568840431}
chi_square | {0.000100675718191977,0.0232317400546175}
p_value | {0.991994376850758,0.878855984657948}
</pre></li>
</ol>
</dd></dl>
<p><a class="anchor" id="background"></a></p>
<dl class="section user"><dt>Technical Background</dt><dd></dd></dl>
<p>Generally, proportional-hazard models start with a list of \( \boldsymbol n \) observations, each with \( \boldsymbol m \) covariates and a time of death. From this \( \boldsymbol n \times m \) matrix, we would like to derive the correlation between the covariates and the hazard function. This amounts to finding the parameters \( \boldsymbol \beta \) that best fit the model described below.</p>
<p>Let us define:</p>
<ul>
<li>\( \boldsymbol t \in \mathbf R^{m} \) denote the vector of observed dependent variables, with \( n \) rows.</li>
<li>\( X \in \mathbf R^{m} \) denote the design matrix with \( m \) columns and \( n \) rows, containing all observed vectors of independent variables \( \boldsymbol x_i \) as rows.</li>
<li>\( R(t_i) \) denote the set of observations still alive at time \( t_i \)</li>
</ul>
<p>Note that this model <b>does not</b> include a <b>constant term</b>, and the data cannot contain a column of 1s.</p>
<p>By definition, </p>
<p class="formulaDsp">
\[ P[T_k = t_i | \boldsymbol R(t_i)] = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}. \,. \]
</p>
<p>The <b>partial likelihood </b>function can now be generated as the product of conditional probabilities: </p>
<p class="formulaDsp">
\[ \mathcal L = \prod_{i = 1}^n \left( \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}} \right). \]
</p>
<p>The log-likelihood form of this equation is </p>
<p class="formulaDsp">
\[ L = \sum_{i = 1}^n \left[ \beta^T x_i - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right) \right]. \]
</p>
<p>Using this score function and Hessian matrix, the partial likelihood can be maximized using the <b> Newton-Raphson algorithm</b>. <b>Breslow's method</b> is used to resolved tied times of deaths. The time of death for two records are considered "equal" if they differ by less than 1.0e-6</p>
<p>The inverse of the Hessian matrix, evaluated at the estimate of \( \boldsymbol \beta \), can be used as an <b>approximate variance-covariance matrix </b> for the estimate, and used to produce approximate <b>standard errors</b> for the regression coefficients.</p>
<p class="formulaDsp">
\[ \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii} \,. \]
</p>
<p> The Wald z-statistic is </p>
<p class="formulaDsp">
\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]
</p>
<p>The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore </p>
<p class="formulaDsp">
\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]
</p>
<p> where \( Z \) is a standard normally distributed random variable.</p>
<p>The condition number is computed as \( \kappa(H) \) during the iteration immediately <em>preceding</em> convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.</p>
<p><a class="anchor" id="Literature"></a></p>
<dl class="section user"><dt>Literature</dt><dd></dd></dl>
<p>A somewhat random selection of nice write-ups, with valuable pointers into further literature:</p>
<p>[1] John Fox: Cox Proportional-Hazards Regression for Survival Data, Appendix to An R and S-PLUS companion to Applied Regression Feb 2012, <a href="http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf">http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf</a></p>
<p>[2] Stephen J Walters: What is a Cox model? <a href="http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf">http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf</a></p>
<p><a class="anchor" id="related"></a></p>
<dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
<p>File <a class="el" href="cox__prop__hazards_8sql__in.html" title="SQL functions for cox proportional hazards. ">cox_prop_hazards.sql_in</a> documenting the functions </p>
</div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="footer">Generated on Thu Jul 3 2014 17:38:00 for MADlib by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.4 </li>
</ul>
</div>
</body>
</html>