blob: bd6a884407e8134f0a726fa8abd5b2ebc09b8a17 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"id": "d413184a",
"metadata": {},
"source": [
"# Hamilton - Time Series model\n",
"\n",
"#### Requirements:\n",
"\n",
"- Install dependencies (listed in `requirements.txt`)\n",
"- Download and decompress the data\n",
"\n",
"More details on how to set up your environment can be found [here](https://github.com/flaviassantos/hamilton/tree/main/examples/model_examples/time-series#set-up).\n",
"\n",
"***\n",
"\n",
"Uncomment and run the cell below if you are in a Google Colab environment. It will:\n",
"1. Mount google drive. You will be asked to authenticate and give permissions.\n",
"2. Change directory to google drive.\n",
"3. Make a directory \"hamilton-tutorials\"\n",
"4. Change directory to it.\n",
"5. Clone this repository to your google drive\n",
"6. Move your current directory to the hello_world example\n",
"7. Install requirements.\n",
"\n",
"This means that any modifications will be saved, and you won't lose them if you close your browser."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d7b66f17",
"metadata": {},
"outputs": [],
"source": [
"## 1. Mount google drive\n",
"# from google.colab import drive\n",
"# drive.mount('/content/drive')\n",
"## 2. Change directory to google drive.\n",
"# %cd /content/drive/MyDrive\n",
"## 3. Make a directory \"hamilton-tutorials\"\n",
"# !mkdir hamilton-tutorials\n",
"## 4. Change directory to it.\n",
"# %cd hamilton-tutorials\n",
"## 5. Clone this repository to your google drive\n",
"# !git clone https://github.com/DAGWorks-Inc/hamilton/\n",
"## 6. Move your current directory to the hello_world example\n",
"# %cd hamilton/examples/hello_world\n",
"## 7. Install requirements.\n",
"# %pip install -r requirements.txt\n",
"# clear_output() # optionally clear outputs\n",
"# To check your current working directory you can type `!pwd` in a cell and run it."
]
},
{
"cell_type": "markdown",
"id": "a42e3dea",
"metadata": {},
"source": [
"***\n",
"This is an example of how one might use Hamilton using the M5 Forecasting Kaggle challenge as an example. \n",
"\n",
">For demonstration purposes, the data used to train the model in this notebook has been reduced.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ab149af3",
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import sys\n",
"import time\n",
"\n",
"import data_loaders\n",
"import model_pipeline\n",
"import pandas as pd\n",
"import transforms\n",
"\n",
"from hamilton import driver"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "61f0745a",
"metadata": {},
"outputs": [],
"source": [
"logger = logging.getLogger(__name__)\n",
"\n",
"\n",
"# this is hard coded here, but it could be passed in, or in some other versioned file.\n",
"model_params = {\n",
" \"num_leaves\": 55,\n",
" \"min_child_weight\": 0.034,\n",
" \"feature_fraction\": 0.379,\n",
" \"bagging_fraction\": 0.418,\n",
" \"min_data_in_leaf\": 106,\n",
" \"objective\": \"regression\",\n",
" \"max_depth\": -1,\n",
" \"learning_rate\": 0.005,\n",
" \"boosting_type\": \"gbdt\",\n",
" \"bagging_seed\": 11,\n",
" \"metric\": \"rmse\",\n",
" \"verbosity\": -1,\n",
" \"reg_alpha\": 0.3899,\n",
" \"reg_lambda\": 0.648,\n",
" \"random_state\": 222,\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "896d8df9",
"metadata": {},
"outputs": [],
"source": [
"def main():\n",
" \"\"\"The main function to orchestrate everything.\"\"\"\n",
" start_time = time.time()\n",
" config = {\n",
" \"calendar_path\": \"m5-forecasting-accuracy/calendar.csv\",\n",
" \"sell_prices_path\": \"m5-forecasting-accuracy/sell_prices.csv\",\n",
" \"sales_train_validation_path\": \"m5-forecasting-accuracy/sales_train_validation.csv\",\n",
" \"submission_path\": \"m5-forecasting-accuracy/sample_submission.csv\",\n",
" \"load_test2\": \"False\",\n",
" \"n_fold\": 2,\n",
" \"model_params\": model_params,\n",
" \"num_rows_to_skip\": 2750000, # for training set\n",
" }\n",
" dr = driver.Driver(config, data_loaders, transforms, model_pipeline)\n",
" dr.display_all_functions(\"./all_functions.dot\", {\"format\": \"png\"})\n",
" dr.visualize_execution(\n",
" [\"kaggle_submission_df\"], \"./kaggle_submission_df.dot\", {\"format\": \"png\"}\n",
" )\n",
" kaggle_submission_df: pd.DataFrame = dr.execute([\"kaggle_submission_df\"])\n",
" duration = time.time() - start_time\n",
" logger.info(f\"Duration: {duration}\")\n",
" kaggle_submission_df.to_csv(\"kaggle_submission_df.csv\", index=False)\n",
" logger.info(f\"Shape of submission DF: {kaggle_submission_df.shape}\")\n",
" logger.info(kaggle_submission_df.head())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f996b4c9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING:hamilton.telemetry:Note: Hamilton collects completely anonymous data about usage. This will help us improve Hamilton over time. See https://github.com/dagworks-inc/hamilton#usage-analytics--data-privacy for details.\n",
"INFO:data_loaders:Loading from parquet.\n",
"INFO:data_loaders:submission has 60980 rows and 29 columns\n",
"INFO:data_loaders:Loading from parquet.\n",
"INFO:data_loaders:sales_train_validation has 3049 rows and 1919 columns\n",
"INFO:utils:sales_train_validation: Mem. usage decreased to 322.63 Mb (9.4% reduction)\n",
"INFO:data_loaders:Melted sales train validation has 5832737 rows and 8 columns\n",
"INFO:data_loaders:Loading from parquet.\n",
"INFO:utils:calendar: Mem. usage decreased to 0.12 Mb (41.9% reduction)\n",
"INFO:data_loaders:calendar has 1969 rows and 14 columns\n",
"INFO:data_loaders:Loading from parquet.\n",
"INFO:utils:sell_prices: Mem. usage decreased to 14.35 Mb (31.2% reduction)\n",
"INFO:data_loaders:sell_prices has 684112 rows and 4 columns\n",
"INFO:data_loaders:Our final dataset to train has 3936457 rows and 18 columns\n",
"INFO:model_pipeline:Fold: 1\n",
"Training until validation scores don't improve for 50 rounds\n",
"[100]\ttraining's rmse: 3.25014\tvalid_1's rmse: 2.51533\n",
"[200]\ttraining's rmse: 2.91563\tvalid_1's rmse: 2.25289\n",
"[300]\ttraining's rmse: 2.74582\tvalid_1's rmse: 2.13314\n",
"[400]\ttraining's rmse: 2.64634\tvalid_1's rmse: 2.07836\n",
"[500]\ttraining's rmse: 2.58538\tvalid_1's rmse: 2.05238\n",
"[600]\ttraining's rmse: 2.54025\tvalid_1's rmse: 2.03735\n",
"[700]\ttraining's rmse: 2.50289\tvalid_1's rmse: 2.02839\n",
"[800]\ttraining's rmse: 2.47336\tvalid_1's rmse: 2.02276\n",
"[900]\ttraining's rmse: 2.44919\tvalid_1's rmse: 2.01929\n",
"[1000]\ttraining's rmse: 2.42904\tvalid_1's rmse: 2.01676\n",
"[1100]\ttraining's rmse: 2.4128\tvalid_1's rmse: 2.01466\n",
"[1200]\ttraining's rmse: 2.39598\tvalid_1's rmse: 2.01299\n",
"[1300]\ttraining's rmse: 2.38076\tvalid_1's rmse: 2.01118\n",
"[1400]\ttraining's rmse: 2.36718\tvalid_1's rmse: 2.01036\n",
"[1500]\ttraining's rmse: 2.35452\tvalid_1's rmse: 2.0093\n",
"[1600]\ttraining's rmse: 2.34183\tvalid_1's rmse: 2.00807\n",
"[1700]\ttraining's rmse: 2.33171\tvalid_1's rmse: 2.00754\n",
"[1800]\ttraining's rmse: 2.32108\tvalid_1's rmse: 2.00679\n",
"[1900]\ttraining's rmse: 2.31152\tvalid_1's rmse: 2.00647\n",
"Early stopping, best iteration is:\n",
"[1904]\ttraining's rmse: 2.31109\tvalid_1's rmse: 2.00637\n",
"INFO:model_pipeline:val rmse score is 2.006370856623504\n",
"INFO:model_pipeline:Fold: 2\n",
"Training until validation scores don't improve for 50 rounds\n",
"[100]\ttraining's rmse: 2.90405\tvalid_1's rmse: 2.95735\n",
"[200]\ttraining's rmse: 2.59704\tvalid_1's rmse: 2.61429\n",
"[300]\ttraining's rmse: 2.4446\tvalid_1's rmse: 2.44886\n",
"[400]\ttraining's rmse: 2.36386\tvalid_1's rmse: 2.36853\n",
"[500]\ttraining's rmse: 2.31457\tvalid_1's rmse: 2.32917\n",
"[600]\ttraining's rmse: 2.28058\tvalid_1's rmse: 2.30873\n",
"[700]\ttraining's rmse: 2.25312\tvalid_1's rmse: 2.29866\n",
"[800]\ttraining's rmse: 2.23147\tvalid_1's rmse: 2.29243\n",
"[900]\ttraining's rmse: 2.21343\tvalid_1's rmse: 2.28946\n",
"[1000]\ttraining's rmse: 2.19836\tvalid_1's rmse: 2.28616\n",
"[1100]\ttraining's rmse: 2.18561\tvalid_1's rmse: 2.28446\n",
"[1200]\ttraining's rmse: 2.17414\tvalid_1's rmse: 2.28401\n",
"Early stopping, best iteration is:\n",
"[1173]\ttraining's rmse: 2.17722\tvalid_1's rmse: 2.28349\n",
"INFO:model_pipeline:val rmse score is 2.283491928626761\n",
"INFO:model_pipeline:mean rmse score over folds is 2.1449313926251325\n",
"INFO:__main__:Duration: 441.76528692245483\n",
"INFO:__main__:Shape of submission DF: (60980, 29)\n",
"INFO:__main__: id F1 F2 F3 F4 \\\n",
"0 HOBBIES_1_001_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n",
"1 HOBBIES_1_002_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n",
"2 HOBBIES_1_003_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n",
"3 HOBBIES_1_004_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n",
"4 HOBBIES_1_005_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n",
"\n",
" F5 F6 F7 F8 F9 ... F19 F20 \\\n",
"0 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n",
"1 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n",
"2 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n",
"3 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n",
"4 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n",
"\n",
" F21 F22 F23 F24 F25 F26 F27 \\\n",
"0 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n",
"1 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n",
"2 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n",
"3 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n",
"4 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n",
"\n",
" F28 \n",
"0 0.128263 \n",
"1 0.128263 \n",
"2 0.128263 \n",
"3 0.128263 \n",
"4 0.128263 \n",
"\n",
"[5 rows x 29 columns]\n"
]
}
],
"source": [
"logging.basicConfig(level=logging.INFO, stream=sys.stdout)\n",
"main()"
]
},
{
"cell_type": "markdown",
"id": "ccc98425",
"metadata": {},
"source": [
"***\n",
"Here's the Kaggle Submission DAG that this code executes:\n",
"***\n",
"![DAG](kaggle_submission_df.dot.png)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "time",
"language": "python",
"name": "time"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}