blob: 57cbae3283f466850770ac8ee84ea0e5a2a01941 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Path function\n",
"\n",
"The goal of the MADlib path function is to perform regular pattern matching over a sequence of rows, and to extract useful information about the pattern matches. The useful information could be a simple count of matches or something more involved like aggregations or window functions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"%load_ext sql"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"u'Connected: fmcquillan@madlib'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# %sql postgresql://gpdbchina@10.194.10.68:55000/madlib\n",
"%sql postgresql://fmcquillan@localhost:5432/madlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%sql select madlib.version();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data set describes shopper behavior on a notional web site that sells beer and wine. A beacon fires an event to a log file when the shopper visits different pages on the site: landing page, beer selection page, wine selection page, and checkout. Other pages on the site like help pages show up in the logs as well. Let’s assume that the log has been sessionized."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"Done.\n",
"34 rows affected.\n",
"34 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>event_timestamp</th>\n",
" <th>user_id</th>\n",
" <th>session_id</th>\n",
" <th>page</th>\n",
" <th>revenue</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:03:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:04:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:05:00</td>\n",
" <td>102201</td>\n",
" <td>106</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:05:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>CHECKOUT</td>\n",
" <td>39.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:06:00</td>\n",
" <td>102201</td>\n",
" <td>106</td>\n",
" <td>HELP</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:09:00</td>\n",
" <td>102201</td>\n",
" <td>106</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:15:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:16:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:17:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>CHECKOUT</td>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:18:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:19:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>HELP</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:21:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:22:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>CHECKOUT</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:06:00</td>\n",
" <td>100821</td>\n",
" <td>101</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:09:00</td>\n",
" <td>100821</td>\n",
" <td>101</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:15:00</td>\n",
" <td>102201</td>\n",
" <td>107</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:15:00</td>\n",
" <td>101331</td>\n",
" <td>103</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:16:00</td>\n",
" <td>102201</td>\n",
" <td>107</td>\n",
" <td>BEER</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:16:00</td>\n",
" <td>101331</td>\n",
" <td>103</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:17:00</td>\n",
" <td>101331</td>\n",
" <td>103</td>\n",
" <td>HELP</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:17:00</td>\n",
" <td>103711</td>\n",
" <td>109</td>\n",
" <td>BEER</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:17:00</td>\n",
" <td>102201</td>\n",
" <td>107</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:18:00</td>\n",
" <td>102871</td>\n",
" <td>108</td>\n",
" <td>BEER</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:18:00</td>\n",
" <td>103711</td>\n",
" <td>109</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:18:00</td>\n",
" <td>101331</td>\n",
" <td>103</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:19:00</td>\n",
" <td>103711</td>\n",
" <td>109</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:19:00</td>\n",
" <td>101331</td>\n",
" <td>103</td>\n",
" <td>CHECKOUT</td>\n",
" <td>16.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:19:00</td>\n",
" <td>102871</td>\n",
" <td>108</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:22:00</td>\n",
" <td>101443</td>\n",
" <td>104</td>\n",
" <td>BEER</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:22:00</td>\n",
" <td>102871</td>\n",
" <td>108</td>\n",
" <td>CHECKOUT</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:25:00</td>\n",
" <td>102871</td>\n",
" <td>108</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:25:00</td>\n",
" <td>101443</td>\n",
" <td>104</td>\n",
" <td>CHECKOUT</td>\n",
" <td>12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:29:00</td>\n",
" <td>101881</td>\n",
" <td>105</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 02:30:00</td>\n",
" <td>101881</td>\n",
" <td>105</td>\n",
" <td>BEER</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(datetime.datetime(2015, 4, 15, 1, 3), 100821, 100, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 4), 100821, 100, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 5), 102201, 106, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 5), 100821, 100, u'CHECKOUT', 39.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 6), 102201, 106, u'HELP', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 9), 102201, 106, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 15), 101121, 102, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 16), 101121, 102, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 17), 101121, 102, u'CHECKOUT', 15.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 18), 101121, 102, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 19), 101121, 102, u'HELP', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 21), 101121, 102, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 22), 101121, 102, u'CHECKOUT', 23.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 6), 100821, 101, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 9), 100821, 101, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 15), 102201, 107, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 15), 101331, 103, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 16), 102201, 107, u'BEER', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 16), 101331, 103, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 17), 101331, 103, u'HELP', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 17), 103711, 109, u'BEER', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 17), 102201, 107, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 18), 102871, 108, u'BEER', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 18), 103711, 109, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 18), 101331, 103, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 19), 103711, 109, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 19), 101331, 103, u'CHECKOUT', 16.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 19), 102871, 108, u'WINE', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 22), 101443, 104, u'BEER', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 22), 102871, 108, u'CHECKOUT', 21.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 25), 102871, 108, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 25), 101443, 104, u'CHECKOUT', 12.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 29), 101881, 105, u'LANDING', 0.0),\n",
" (datetime.datetime(2015, 4, 15, 2, 30), 101881, 105, u'BEER', 0.0)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql \n",
"DROP TABLE IF EXISTS eventlog, path_output, path_output_tuples CASCADE;\n",
"CREATE TABLE eventlog (event_timestamp TIMESTAMP,\n",
" user_id INT,\n",
" session_id INT,\n",
" page TEXT,\n",
" revenue FLOAT);\n",
"INSERT INTO eventlog VALUES\n",
"('04/15/2015 01:03:00', 100821, 100, 'LANDING', 0),\n",
"('04/15/2015 01:04:00', 100821, 100, 'WINE', 0),\n",
"('04/15/2015 01:05:00', 100821, 100, 'CHECKOUT', 39),\n",
"('04/15/2015 02:06:00', 100821, 101, 'WINE', 0),\n",
"('04/15/2015 02:09:00', 100821, 101, 'WINE', 0),\n",
"('04/15/2015 01:15:00', 101121, 102, 'LANDING', 0),\n",
"('04/15/2015 01:16:00', 101121, 102, 'WINE', 0),\n",
"('04/15/2015 01:17:00', 101121, 102, 'CHECKOUT', 15),\n",
"('04/15/2015 01:18:00', 101121, 102, 'LANDING', 0),\n",
"('04/15/2015 01:19:00', 101121, 102, 'HELP', 0),\n",
"('04/15/2015 01:21:00', 101121, 102, 'WINE', 0),\n",
"('04/15/2015 01:22:00', 101121, 102, 'CHECKOUT', 23),\n",
"('04/15/2015 02:15:00', 101331, 103, 'LANDING', 0),\n",
"('04/15/2015 02:16:00', 101331, 103, 'WINE', 0),\n",
"('04/15/2015 02:17:00', 101331, 103, 'HELP', 0),\n",
"('04/15/2015 02:18:00', 101331, 103, 'WINE', 0),\n",
"('04/15/2015 02:19:00', 101331, 103, 'CHECKOUT', 16),\n",
"('04/15/2015 02:22:00', 101443, 104, 'BEER', 0),\n",
"('04/15/2015 02:25:00', 101443, 104, 'CHECKOUT', 12),\n",
"('04/15/2015 02:29:00', 101881, 105, 'LANDING', 0),\n",
"('04/15/2015 02:30:00', 101881, 105, 'BEER', 0),\n",
"('04/15/2015 01:05:00', 102201, 106, 'LANDING', 0),\n",
"('04/15/2015 01:06:00', 102201, 106, 'HELP', 0),\n",
"('04/15/2015 01:09:00', 102201, 106, 'LANDING', 0),\n",
"('04/15/2015 02:15:00', 102201, 107, 'WINE', 0),\n",
"('04/15/2015 02:16:00', 102201, 107, 'BEER', 0),\n",
"('04/15/2015 02:17:00', 102201, 107, 'WINE', 0),\n",
"('04/15/2015 02:18:00', 102871, 108, 'BEER', 0),\n",
"('04/15/2015 02:19:00', 102871, 108, 'WINE', 0),\n",
"('04/15/2015 02:22:00', 102871, 108, 'CHECKOUT', 21),\n",
"('04/15/2015 02:25:00', 102871, 108, 'LANDING', 0),\n",
"('04/15/2015 02:17:00', 103711, 109, 'BEER', 0),\n",
"('04/15/2015 02:18:00', 103711, 109, 'LANDING', 0),\n",
"('04/15/2015 02:19:00', 103711, 109, 'WINE', 0);\n",
"\n",
"SELECT * FROM eventlog ORDER BY event_timestamp ASC;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the revenue by checkout:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"%%sql\n",
"SELECT madlib.path(\n",
" 'eventlog', -- Name of input table\n",
" 'path_output', -- Table name to store path results\n",
" 'session_id', -- Partition input table by session\n",
" 'event_timestamp ASC', -- Order partitions in input table by time\n",
" 'buy:=page=''CHECKOUT''', -- Define a symbol for checkout events\n",
" '(buy)', -- Pattern search: purchase\n",
" 'sum(revenue) as checkout_rev', -- Aggregate: sum revenue by checkout\n",
" TRUE -- Persist matches\n",
" );\n",
"SELECT * FROM path_output ORDER BY session_id, match_id;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that there are 2 checkouts within session 102, which is apparent from the 'match_id' column. This serves to illustrate that the 'aggregate_func' operates on a per pattern match basis, not on a per partition basis. If in fact we wanted revenue by partition ('session_id' in this example), then we could do:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"SELECT session_id, sum(checkout_rev) FROM path_output GROUP BY session_id ORDER BY session_id;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we set TRUE for 'persist_rows', we can view the associated pattern matches:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"SELECT * FROM path_output_tuples ORDER BY session_id ASC, event_timestamp ASC;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the 'symbol' and 'match_id' columns are added to the right of the matched rows.\n",
"\n",
"We are interested in sessions with an order placed within 4 pages of entering the shopping site via the landing page. We represent this by the regular expression: '(land)[^(land)(buy)]{0,2}(buy)'. In other words, visit to the landing page followed by from 0 to 2 non-entry, non-sale pages, followed by a purchase. The SQL is as follows:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"1 rows affected.\n",
"3 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>session_id</th>\n",
" <th>match_id</th>\n",
" <th>checkout_rev</th>\n",
" </tr>\n",
" <tr>\n",
" <td>100</td>\n",
" <td>1.0</td>\n",
" <td>39.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>102</td>\n",
" <td>1.0</td>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>102</td>\n",
" <td>2.0</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(100, 1.0, 39.0), (102, 1.0, 15.0), (102, 2.0, 23.0)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS path_output, path_output_tuples;\n",
"SELECT madlib.path(\n",
" 'eventlog', -- Name of input table\n",
" 'path_output', -- Table name to store path results\n",
" 'session_id', -- Partition input table by session\n",
" 'event_timestamp ASC', -- Order partitions in input table by time\n",
" $$ land:=page='LANDING',\n",
" wine:=page='WINE',\n",
" beer:=page='BEER',\n",
" buy:=page='CHECKOUT',\n",
" other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND page<>'CHECKOUT'\n",
" $$, -- Symbols for page types\n",
"\n",
" '(land)[^(land)(buy)]{0,2}(buy)', -- Purchase within 4 pages entering site\n",
" 'sum(revenue) as checkout_rev', -- Aggregate: sum revenue by checkout\n",
" TRUE -- Persist matches\n",
" );\n",
"SELECT * FROM path_output ORDER BY session_id, match_id;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now view the associated pattern matches:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 rows affected.\n"
]
},
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <th>event_timestamp</th>\n",
" <th>user_id</th>\n",
" <th>session_id</th>\n",
" <th>page</th>\n",
" <th>revenue</th>\n",
" <th>symbol</th>\n",
" <th>match_id</th>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:03:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" <td>land</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:04:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" <td>wine</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:05:00</td>\n",
" <td>100821</td>\n",
" <td>100</td>\n",
" <td>CHECKOUT</td>\n",
" <td>39.0</td>\n",
" <td>buy</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:15:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" <td>land</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:16:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" <td>wine</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:17:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>CHECKOUT</td>\n",
" <td>15.0</td>\n",
" <td>buy</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:18:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>LANDING</td>\n",
" <td>0.0</td>\n",
" <td>land</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:19:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>HELP</td>\n",
" <td>0.0</td>\n",
" <td>other</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:21:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>WINE</td>\n",
" <td>0.0</td>\n",
" <td>wine</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-04-15 01:22:00</td>\n",
" <td>101121</td>\n",
" <td>102</td>\n",
" <td>CHECKOUT</td>\n",
" <td>23.0</td>\n",
" <td>buy</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
"</table>"
],
"text/plain": [
"[(datetime.datetime(2015, 4, 15, 1, 3), 100821, 100, u'LANDING', 0.0, u'land', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 4), 100821, 100, u'WINE', 0.0, u'wine', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 5), 100821, 100, u'CHECKOUT', 39.0, u'buy', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 15), 101121, 102, u'LANDING', 0.0, u'land', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 16), 101121, 102, u'WINE', 0.0, u'wine', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 17), 101121, 102, u'CHECKOUT', 15.0, u'buy', 1.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 18), 101121, 102, u'LANDING', 0.0, u'land', 2.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 19), 101121, 102, u'HELP', 0.0, u'other', 2.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 21), 101121, 102, u'WINE', 0.0, u'wine', 2.0),\n",
" (datetime.datetime(2015, 4, 15, 1, 22), 101121, 102, u'CHECKOUT', 23.0, u'buy', 2.0)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT * FROM path_output_tuples ORDER BY session_id ASC, event_timestamp ASC;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For instances where a purchase is made within 4 pages of entering a site, compute the elasped time to checkout:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS path_output, path_output_tuples;\n",
"SELECT madlib.path(\n",
" 'eventlog', -- Name of input table\n",
" 'path_output', -- Table name to store path results\n",
" 'session_id', -- Partition input table by session\n",
" 'event_timestamp ASC', -- Order partitions in input table by time\n",
" $$ land:=page='LANDING',\n",
" wine:=page='WINE',\n",
" beer:=page='BEER',\n",
" buy:=page='CHECKOUT',\n",
" other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND page<>'CHECKOUT'\n",
" $$, -- Symbols for page types\n",
" '(land)[^(land)(buy)]{0,2}(buy)', -- Purchase within 4 pages entering site\n",
" '(max(event_timestamp)-min(event_timestamp)) as elapsed_time', -- Aggregate: elapsed time\n",
" TRUE -- Persist matches\n",
" );\n",
"SELECT * FROM path_output ORDER BY session_id, match_id;\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may want to use a window function instead of an aggregate. You can write window functions on the output tuples to achieve the desired result. Continuing the previous example, let’s say we want to compute average revenue for checkouts within 4 pages of entering the shopping site via the landing page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"SELECT DATE(event_timestamp), user_id, session_id, revenue,\n",
" avg(revenue) OVER (PARTITION BY DATE(event_timestamp)) as avg_checkout_rev\n",
" FROM path_output_tuples\n",
" WHERE page='CHECKOUT'\n",
" ORDER BY user_id, session_id;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to do a golden path analysis to find the most successful shopper paths through the site. Since our data set is small, we decide this means the most frequently viewed page just before a checkout is made:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS path_output, path_output_tuples;\n",
"SELECT madlib.path(\n",
" 'eventlog', -- Name of input table\n",
" 'path_output', -- Table name to store path results\n",
" 'session_id', -- Partition input table by session\n",
" 'event_timestamp ASC', -- Order partitions in input table by time\n",
" $$ land:=page='LANDING',\n",
" wine:=page='WINE',\n",
" beer:=page='BEER',\n",
" buy:=page='CHECKOUT',\n",
" other:=page<>'LANDING' AND page<>'WINE' AND page<>'BEER' AND page<>'CHECKOUT'\n",
" $$, -- Symbols for page types\n",
" '[^(buy)](buy)', -- Pattern to match\n",
" 'array_agg(page ORDER BY session_id ASC, event_timestamp ASC) as page_path');\n",
" \n",
"SELECT count(*), page_path from\n",
" (SELECT * FROM path_output) q\n",
"GROUP BY page_path\n",
"ORDER BY count(*) DESC\n",
"LIMIT 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are only 2 different paths. The wine page is viewed more frequently than the beer page just before checkout.\n",
"\n",
"To demonstrate the use of 'overlapping_patterns', consider a pattern with at least one page followed by and ending with a checkout:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS path_output, path_output_tuples;\n",
"SELECT madlib.path( \n",
" 'eventlog', -- Name of the table \n",
" 'path_output', -- Table name to store the path results \n",
" 'session_id', -- Partition by session \n",
" 'event_timestamp ASC', -- Order partitions in input table by time \n",
" $$ nobuy:=page<>'CHECKOUT',\n",
" buy:=page='CHECKOUT'\n",
" $$, -- Definition of symbols used in the pattern definition \n",
" '(nobuy)+(buy)', -- At least one page followed by and ending with a CHECKOUT.\n",
" 'array_agg(page ORDER BY session_id ASC, event_timestamp ASC) as page_path', \n",
" FALSE, -- Don't persist matches\n",
" TRUE -- Turn on overlapping patterns\n",
" );\n",
"SELECT * FROM path_output ORDER BY session_id, match_id;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With overlap turned off, the result is:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS path_output, path_output_tuples;\n",
"SELECT madlib.path( \n",
" 'eventlog', -- Name of the table \n",
" 'path_output', -- Table name to store the path results \n",
" 'session_id', -- Partition by session \n",
" 'event_timestamp ASC', -- Order partitions in input table by time \n",
" $$ nobuy:=page<>'CHECKOUT',\n",
" buy:=page='CHECKOUT'\n",
" $$, -- Definition of symbols used in the pattern definition \n",
" '(nobuy)+(buy)', -- At least one page followed by and ending with a CHECKOUT.\n",
" 'array_agg(page ORDER BY session_id ASC, event_timestamp ASC) as page_path', \n",
" FALSE, -- Don't persist matches\n",
" FALSE -- Turn on overlapping patterns\n",
" );\n",
"SELECT * FROM path_output ORDER BY session_id, match_id;"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}