| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "id": "d413184a", |
| "metadata": {}, |
| "source": [ |
| "# Hamilton - Time Series model\n", |
| "\n", |
| "#### Requirements:\n", |
| "\n", |
| "- Install dependencies (listed in `requirements.txt`)\n", |
| "- Download and decompress the data\n", |
| "\n", |
| "More details on how to set up your environment can be found [here](https://github.com/flaviassantos/hamilton/tree/main/examples/model_examples/time-series#set-up).\n", |
| "\n", |
| "***\n", |
| "\n", |
| "Uncomment and run the cell below if you are in a Google Colab environment. It will:\n", |
| "1. Mount google drive. You will be asked to authenticate and give permissions.\n", |
| "2. Change directory to google drive.\n", |
| "3. Make a directory \"hamilton-tutorials\"\n", |
| "4. Change directory to it.\n", |
| "5. Clone this repository to your google drive\n", |
| "6. Move your current directory to the hello_world example\n", |
| "7. Install requirements.\n", |
| "\n", |
| "This means that any modifications will be saved, and you won't lose them if you close your browser." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "id": "d7b66f17", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "## 1. Mount google drive\n", |
| "# from google.colab import drive\n", |
| "# drive.mount('/content/drive')\n", |
| "## 2. Change directory to google drive.\n", |
| "# %cd /content/drive/MyDrive\n", |
| "## 3. Make a directory \"hamilton-tutorials\"\n", |
| "# !mkdir hamilton-tutorials\n", |
| "## 4. Change directory to it.\n", |
| "# %cd hamilton-tutorials\n", |
| "## 5. Clone this repository to your google drive\n", |
| "# !git clone https://github.com/DAGWorks-Inc/hamilton/\n", |
| "## 6. Move your current directory to the hello_world example\n", |
| "# %cd hamilton/examples/hello_world\n", |
| "## 7. Install requirements.\n", |
| "# %pip install -r requirements.txt\n", |
| "# clear_output() # optionally clear outputs\n", |
| "# To check your current working directory you can type `!pwd` in a cell and run it." |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "a42e3dea", |
| "metadata": {}, |
| "source": [ |
| "***\n", |
| "This is an example of how one might use Hamilton using the M5 Forecasting Kaggle challenge as an example. \n", |
| "\n", |
| ">For demonstration purposes, the data used to train the model in this notebook has been reduced.\n", |
| "***" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "id": "ab149af3", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "import logging\n", |
| "import sys\n", |
| "import time\n", |
| "\n", |
| "import data_loaders\n", |
| "import model_pipeline\n", |
| "import pandas as pd\n", |
| "import transforms\n", |
| "\n", |
| "from hamilton import driver" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "id": "61f0745a", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "logger = logging.getLogger(__name__)\n", |
| "\n", |
| "\n", |
| "# this is hard coded here, but it could be passed in, or in some other versioned file.\n", |
| "model_params = {\n", |
| " \"num_leaves\": 55,\n", |
| " \"min_child_weight\": 0.034,\n", |
| " \"feature_fraction\": 0.379,\n", |
| " \"bagging_fraction\": 0.418,\n", |
| " \"min_data_in_leaf\": 106,\n", |
| " \"objective\": \"regression\",\n", |
| " \"max_depth\": -1,\n", |
| " \"learning_rate\": 0.005,\n", |
| " \"boosting_type\": \"gbdt\",\n", |
| " \"bagging_seed\": 11,\n", |
| " \"metric\": \"rmse\",\n", |
| " \"verbosity\": -1,\n", |
| " \"reg_alpha\": 0.3899,\n", |
| " \"reg_lambda\": 0.648,\n", |
| " \"random_state\": 222,\n", |
| "}" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 5, |
| "id": "896d8df9", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "def main():\n", |
| " \"\"\"The main function to orchestrate everything.\"\"\"\n", |
| " start_time = time.time()\n", |
| " config = {\n", |
| " \"calendar_path\": \"m5-forecasting-accuracy/calendar.csv\",\n", |
| " \"sell_prices_path\": \"m5-forecasting-accuracy/sell_prices.csv\",\n", |
| " \"sales_train_validation_path\": \"m5-forecasting-accuracy/sales_train_validation.csv\",\n", |
| " \"submission_path\": \"m5-forecasting-accuracy/sample_submission.csv\",\n", |
| " \"load_test2\": \"False\",\n", |
| " \"n_fold\": 2,\n", |
| " \"model_params\": model_params,\n", |
| " \"num_rows_to_skip\": 2750000, # for training set\n", |
| " }\n", |
| " dr = driver.Driver(config, data_loaders, transforms, model_pipeline)\n", |
| " dr.display_all_functions(\"./all_functions.dot\", {\"format\": \"png\"})\n", |
| " dr.visualize_execution(\n", |
| " [\"kaggle_submission_df\"], \"./kaggle_submission_df.dot\", {\"format\": \"png\"}\n", |
| " )\n", |
| " kaggle_submission_df: pd.DataFrame = dr.execute([\"kaggle_submission_df\"])\n", |
| " duration = time.time() - start_time\n", |
| " logger.info(f\"Duration: {duration}\")\n", |
| " kaggle_submission_df.to_csv(\"kaggle_submission_df.csv\", index=False)\n", |
| " logger.info(f\"Shape of submission DF: {kaggle_submission_df.shape}\")\n", |
| " logger.info(kaggle_submission_df.head())" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "id": "f996b4c9", |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "WARNING:hamilton.telemetry:Note: Hamilton collects completely anonymous data about usage. This will help us improve Hamilton over time. See https://github.com/dagworks-inc/hamilton#usage-analytics--data-privacy for details.\n", |
| "INFO:data_loaders:Loading from parquet.\n", |
| "INFO:data_loaders:submission has 60980 rows and 29 columns\n", |
| "INFO:data_loaders:Loading from parquet.\n", |
| "INFO:data_loaders:sales_train_validation has 3049 rows and 1919 columns\n", |
| "INFO:utils:sales_train_validation: Mem. usage decreased to 322.63 Mb (9.4% reduction)\n", |
| "INFO:data_loaders:Melted sales train validation has 5832737 rows and 8 columns\n", |
| "INFO:data_loaders:Loading from parquet.\n", |
| "INFO:utils:calendar: Mem. usage decreased to 0.12 Mb (41.9% reduction)\n", |
| "INFO:data_loaders:calendar has 1969 rows and 14 columns\n", |
| "INFO:data_loaders:Loading from parquet.\n", |
| "INFO:utils:sell_prices: Mem. usage decreased to 14.35 Mb (31.2% reduction)\n", |
| "INFO:data_loaders:sell_prices has 684112 rows and 4 columns\n", |
| "INFO:data_loaders:Our final dataset to train has 3936457 rows and 18 columns\n", |
| "INFO:model_pipeline:Fold: 1\n", |
| "Training until validation scores don't improve for 50 rounds\n", |
| "[100]\ttraining's rmse: 3.25014\tvalid_1's rmse: 2.51533\n", |
| "[200]\ttraining's rmse: 2.91563\tvalid_1's rmse: 2.25289\n", |
| "[300]\ttraining's rmse: 2.74582\tvalid_1's rmse: 2.13314\n", |
| "[400]\ttraining's rmse: 2.64634\tvalid_1's rmse: 2.07836\n", |
| "[500]\ttraining's rmse: 2.58538\tvalid_1's rmse: 2.05238\n", |
| "[600]\ttraining's rmse: 2.54025\tvalid_1's rmse: 2.03735\n", |
| "[700]\ttraining's rmse: 2.50289\tvalid_1's rmse: 2.02839\n", |
| "[800]\ttraining's rmse: 2.47336\tvalid_1's rmse: 2.02276\n", |
| "[900]\ttraining's rmse: 2.44919\tvalid_1's rmse: 2.01929\n", |
| "[1000]\ttraining's rmse: 2.42904\tvalid_1's rmse: 2.01676\n", |
| "[1100]\ttraining's rmse: 2.4128\tvalid_1's rmse: 2.01466\n", |
| "[1200]\ttraining's rmse: 2.39598\tvalid_1's rmse: 2.01299\n", |
| "[1300]\ttraining's rmse: 2.38076\tvalid_1's rmse: 2.01118\n", |
| "[1400]\ttraining's rmse: 2.36718\tvalid_1's rmse: 2.01036\n", |
| "[1500]\ttraining's rmse: 2.35452\tvalid_1's rmse: 2.0093\n", |
| "[1600]\ttraining's rmse: 2.34183\tvalid_1's rmse: 2.00807\n", |
| "[1700]\ttraining's rmse: 2.33171\tvalid_1's rmse: 2.00754\n", |
| "[1800]\ttraining's rmse: 2.32108\tvalid_1's rmse: 2.00679\n", |
| "[1900]\ttraining's rmse: 2.31152\tvalid_1's rmse: 2.00647\n", |
| "Early stopping, best iteration is:\n", |
| "[1904]\ttraining's rmse: 2.31109\tvalid_1's rmse: 2.00637\n", |
| "INFO:model_pipeline:val rmse score is 2.006370856623504\n", |
| "INFO:model_pipeline:Fold: 2\n", |
| "Training until validation scores don't improve for 50 rounds\n", |
| "[100]\ttraining's rmse: 2.90405\tvalid_1's rmse: 2.95735\n", |
| "[200]\ttraining's rmse: 2.59704\tvalid_1's rmse: 2.61429\n", |
| "[300]\ttraining's rmse: 2.4446\tvalid_1's rmse: 2.44886\n", |
| "[400]\ttraining's rmse: 2.36386\tvalid_1's rmse: 2.36853\n", |
| "[500]\ttraining's rmse: 2.31457\tvalid_1's rmse: 2.32917\n", |
| "[600]\ttraining's rmse: 2.28058\tvalid_1's rmse: 2.30873\n", |
| "[700]\ttraining's rmse: 2.25312\tvalid_1's rmse: 2.29866\n", |
| "[800]\ttraining's rmse: 2.23147\tvalid_1's rmse: 2.29243\n", |
| "[900]\ttraining's rmse: 2.21343\tvalid_1's rmse: 2.28946\n", |
| "[1000]\ttraining's rmse: 2.19836\tvalid_1's rmse: 2.28616\n", |
| "[1100]\ttraining's rmse: 2.18561\tvalid_1's rmse: 2.28446\n", |
| "[1200]\ttraining's rmse: 2.17414\tvalid_1's rmse: 2.28401\n", |
| "Early stopping, best iteration is:\n", |
| "[1173]\ttraining's rmse: 2.17722\tvalid_1's rmse: 2.28349\n", |
| "INFO:model_pipeline:val rmse score is 2.283491928626761\n", |
| "INFO:model_pipeline:mean rmse score over folds is 2.1449313926251325\n", |
| "INFO:__main__:Duration: 441.76528692245483\n", |
| "INFO:__main__:Shape of submission DF: (60980, 29)\n", |
| "INFO:__main__: id F1 F2 F3 F4 \\\n", |
| "0 HOBBIES_1_001_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n", |
| "1 HOBBIES_1_002_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n", |
| "2 HOBBIES_1_003_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n", |
| "3 HOBBIES_1_004_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n", |
| "4 HOBBIES_1_005_CA_1_validation 0.067195 0.066662 0.063192 0.063374 \n", |
| "\n", |
| " F5 F6 F7 F8 F9 ... F19 F20 \\\n", |
| "0 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n", |
| "1 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n", |
| "2 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n", |
| "3 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n", |
| "4 0.11702 0.119926 0.121935 0.0991 0.098581 ... 0.121182 0.12979 \n", |
| "\n", |
| " F21 F22 F23 F24 F25 F26 F27 \\\n", |
| "0 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n", |
| "1 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n", |
| "2 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n", |
| "3 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n", |
| "4 0.129668 0.08149 0.080942 0.078254 0.076383 0.124589 0.128385 \n", |
| "\n", |
| " F28 \n", |
| "0 0.128263 \n", |
| "1 0.128263 \n", |
| "2 0.128263 \n", |
| "3 0.128263 \n", |
| "4 0.128263 \n", |
| "\n", |
| "[5 rows x 29 columns]\n" |
| ] |
| } |
| ], |
| "source": [ |
| "logging.basicConfig(level=logging.INFO, stream=sys.stdout)\n", |
| "main()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "ccc98425", |
| "metadata": {}, |
| "source": [ |
| "***\n", |
| "Here's the Kaggle Submission DAG that this code executes:\n", |
| "***\n", |
| "\n" |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "time", |
| "language": "python", |
| "name": "time" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.11.3" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 5 |
| } |