blob: 5c047fd2521d7d793be9698b8cab3301cf96cf92 [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Neural Collaborative Filtering (NCF)\n",
"\n",
"This examples trains a neural network on the MovieLens data set using the concept of [Neural Collaborative Filtering (NCF)](https://dl.acm.org/doi/abs/10.1145/3038912.3052569) that is aimed at approaching recommendation problems using deep neural networks as opposed to common matrix factorization approaches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup and Imports"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Data - MovieLens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The MovieLens data set is provided by the Unniversity of Minnesota and the GroupLens Research Group:\n",
"\n",
"> This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org/), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.<br/>\n",
"Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.<br/>\n",
"The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.<br/>\n",
"This is a development dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available benchmark datasets if that is your intent.<br/>\n",
"This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: ml-latest-small.zip\n",
" creating: ml-latest-small/\n",
" inflating: ml-latest-small/links.csv \n",
" inflating: ml-latest-small/tags.csv \n",
" inflating: ml-latest-small/ratings.csv \n",
" inflating: ml-latest-small/README.txt \n",
" inflating: ml-latest-small/movies.csv \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"\r",
" 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r",
" 5 955k 5 50411 0 0 68679 0 0:00:14 --:--:-- 0:00:14 68586\r",
"100 955k 100 955k 0 0 640k 0 0:00:01 0:00:01 --:--:-- 640k\n"
]
}
],
"source": [
"%%sh\n",
"DATASET=ml-latest-small\n",
"\n",
"mkdir -p data/$DATASET/\n",
"cd data/$DATASET\n",
"curl -O http://files.grouplens.org/datasets/movielens/$DATASET.zip\n",
"unzip $DATASET.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare Data"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"data_loc = \"data/ml-latest-small/ml-latest-small/\"\n",
"negative_split = 1.5 # how many negatives for one positive\n",
"\n",
"# load interactions from MovieLens\n",
"raw_ratings = pd.read_csv(data_loc + \"ratings.csv\")\n",
"positives = pd.DataFrame(raw_ratings, columns=['userId', 'movieId'])\n",
"\n",
"# sample negatives\n",
"negatives = pd.DataFrame(columns=[\"userId\", \"movieId\"])\n",
"\n",
"while len(negatives) < len(positives) * negative_split:\n",
" user = positives[\"userId\"].sample().values[0]\n",
" movie = positives[\"movieId\"].sample().values[0]\n",
" if len(positives.loc[(positives[\"userId\"] == user) & (positives[\"movieId\"] == movie)]) == 0:\n",
" negatives = negatives.append({\"userId\": user, \"movieId\": movie}, ignore_index=True)\n",
"\n",
"# write out final data\n",
"targets = np.hstack([np.ones(len(positives)), np.zeros(len(negatives))])\n",
"all_ratings = np.vstack([positives, negatives])\n",
"\n",
"user_item_targets = np.hstack([all_ratings, targets[:, np.newaxis]])\n",
"\n",
"np.random.shuffle(user_item_targets)\n",
"\n",
"split = train_test_split(user_item_targets, train_size=0.8)\n",
"\n",
"np.savetxt(data_loc + \"sampled-train.csv\", split[0], delimiter=\",\")\n",
"np.savetxt(data_loc + \"sampled-test.csv\", split[1], delimiter=\",\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SystemDS NCF implementation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### with synthetic dummy data"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using user supplied systemds jar file target/SystemDS.jar\n",
"###############################################################################\n",
"# SYSTEMDS_ROOT= .\n",
"# SYSTEMDS_JAR_FILE= target/SystemDS.jar\n",
"# CONFIG_FILE= --config ./target/testTemp/org/apache/sysds/api/mlcontext/MLContext/SystemDS-config.xml\n",
"# LOG4JPROP= -Dlog4j.configuration=file:conf/log4j-silent.properties\n",
"# CLASSPATH= target/SystemDS.jar:./lib/*:./target/lib/*\n",
"# HADOOP_HOME= /Users/patrick/Uni Offline/Architectures of Machine Learning Systems (AMLS)/systemml/target/lib/hadoop\n",
"#\n",
"# Running script scripts/nn/examples/ncf-dummy-data.dml locally with opts: \n",
"###############################################################################\n",
"Executing command: java -Xmx4g -Xms4g -Xmn400m -cp target/SystemDS.jar:./lib/*:./target/lib/* -Dlog4j.configuration=file:conf/log4j-silent.properties org.apache.sysds.api.DMLScript -f scripts/nn/examples/ncf-dummy-data.dml -exec singlenode --config ./target/testTemp/org/apache/sysds/api/mlcontext/MLContext/SystemDS-config.xml \n",
"\n",
"NCF training starting with 1000 training samples, 100 validation samples, 50 items and 60 users...\n",
"Epoch: 1, Iter: 1, Train Loss: 0.6953457411615849, Train Accuracy: 0.5, Val Loss: 0.6995101788248107, Val Accuracy: 0.47\n",
"Epoch: 2, Iter: 1, Train Loss: 0.6667911468574823, Train Accuracy: 0.6875, Val Loss: 0.6992050630414124, Val Accuracy: 0.47\n",
"Epoch: 3, Iter: 1, Train Loss: 0.6570450250431727, Train Accuracy: 0.6875, Val Loss: 0.7014387912966833, Val Accuracy: 0.47\n",
"Epoch: 4, Iter: 1, Train Loss: 0.6521926651745862, Train Accuracy: 0.6875, Val Loss: 0.7053126102214489, Val Accuracy: 0.43999999999999995\n",
"Epoch: 5, Iter: 1, Train Loss: 0.6431405119563119, Train Accuracy: 0.6875, Val Loss: 0.7115121778198469, Val Accuracy: 0.43999999999999995\n",
"Epoch: 6, Iter: 1, Train Loss: 0.6353498336109219, Train Accuracy: 0.6875, Val Loss: 0.7193490066131873, Val Accuracy: 0.44999999999999996\n",
"Epoch: 7, Iter: 1, Train Loss: 0.6308046978859394, Train Accuracy: 0.6875, Val Loss: 0.7306240107462888, Val Accuracy: 0.48\n",
"Epoch: 8, Iter: 1, Train Loss: 0.6260145322748087, Train Accuracy: 0.75, Val Loss: 0.7435853055111923, Val Accuracy: 0.49\n",
"Epoch: 9, Iter: 1, Train Loss: 0.6163475345953953, Train Accuracy: 0.6875, Val Loss: 0.757023909929672, Val Accuracy: 0.5\n",
"Epoch: 10, Iter: 1, Train Loss: 0.6029424406867099, Train Accuracy: 0.6875, Val Loss: 0.7749021987872134, Val Accuracy: 0.51\n",
"Epoch: 11, Iter: 1, Train Loss: 0.5791958103856243, Train Accuracy: 0.8125, Val Loss: 0.7921418272873325, Val Accuracy: 0.51\n",
"Epoch: 12, Iter: 1, Train Loss: 0.5543597535155846, Train Accuracy: 0.8125, Val Loss: 0.8131440342665028, Val Accuracy: 0.5\n",
"Epoch: 13, Iter: 1, Train Loss: 0.5342062981571314, Train Accuracy: 0.8125, Val Loss: 0.8340415360672659, Val Accuracy: 0.45999999999999996\n",
"Epoch: 14, Iter: 1, Train Loss: 0.5156903349054259, Train Accuracy: 0.875, Val Loss: 0.8534000391024407, Val Accuracy: 0.47\n",
"Epoch: 15, Iter: 1, Train Loss: 0.5042912981017884, Train Accuracy: 0.8125, Val Loss: 0.873901869293276, Val Accuracy: 0.44999999999999996\n",
"Epoch: 16, Iter: 1, Train Loss: 0.48722704019844537, Train Accuracy: 0.8125, Val Loss: 0.898510539121238, Val Accuracy: 0.47\n",
"Epoch: 17, Iter: 1, Train Loss: 0.47048381704431463, Train Accuracy: 0.875, Val Loss: 0.9284775525937294, Val Accuracy: 0.48\n",
"Epoch: 18, Iter: 1, Train Loss: 0.45151030675588855, Train Accuracy: 0.875, Val Loss: 0.9574504971357228, Val Accuracy: 0.47\n",
"Epoch: 19, Iter: 1, Train Loss: 0.43940495503523824, Train Accuracy: 0.875, Val Loss: 0.9937811553464448, Val Accuracy: 0.45999999999999996\n",
"Epoch: 20, Iter: 1, Train Loss: 0.42553379542786246, Train Accuracy: 0.875, Val Loss: 1.0231502880025147, Val Accuracy: 0.43999999999999995\n",
"Epoch: 21, Iter: 1, Train Loss: 0.4163223594480222, Train Accuracy: 0.875, Val Loss: 1.0595479122098816, Val Accuracy: 0.45999999999999996\n",
"Epoch: 22, Iter: 1, Train Loss: 0.4050461773338017, Train Accuracy: 0.875, Val Loss: 1.0944624240337406, Val Accuracy: 0.48\n",
"Epoch: 23, Iter: 1, Train Loss: 0.3957080838041942, Train Accuracy: 0.875, Val Loss: 1.1315613394576827, Val Accuracy: 0.47\n",
"Epoch: 24, Iter: 1, Train Loss: 0.39252816032717697, Train Accuracy: 0.8125, Val Loss: 1.1608315131205158, Val Accuracy: 0.47\n",
"Epoch: 25, Iter: 1, Train Loss: 0.38656611677400526, Train Accuracy: 0.8125, Val Loss: 1.2010764396137235, Val Accuracy: 0.45999999999999996\n",
"Epoch: 26, Iter: 1, Train Loss: 0.3910140006546419, Train Accuracy: 0.8125, Val Loss: 1.2394434665872176, Val Accuracy: 0.44999999999999996\n",
"Epoch: 27, Iter: 1, Train Loss: 0.39012809759646405, Train Accuracy: 0.8125, Val Loss: 1.267704284952889, Val Accuracy: 0.43999999999999995\n",
"Epoch: 28, Iter: 1, Train Loss: 0.3986668930898999, Train Accuracy: 0.8125, Val Loss: 1.3134788291583197, Val Accuracy: 0.44999999999999996\n",
"Epoch: 29, Iter: 1, Train Loss: 0.39096586484137014, Train Accuracy: 0.8125, Val Loss: 1.3457368548231847, Val Accuracy: 0.44999999999999996\n",
"Epoch: 30, Iter: 1, Train Loss: 0.3913665786483714, Train Accuracy: 0.8125, Val Loss: 1.395200160764677, Val Accuracy: 0.44999999999999996\n",
"Epoch: 31, Iter: 1, Train Loss: 0.39306020872450564, Train Accuracy: 0.8125, Val Loss: 1.4547617764166234, Val Accuracy: 0.44999999999999996\n",
"Epoch: 32, Iter: 1, Train Loss: 0.3961123079325197, Train Accuracy: 0.8125, Val Loss: 1.4988918781732432, Val Accuracy: 0.45999999999999996\n",
"Epoch: 33, Iter: 1, Train Loss: 0.39167597788728836, Train Accuracy: 0.875, Val Loss: 1.5580225154760752, Val Accuracy: 0.44999999999999996\n",
"Epoch: 34, Iter: 1, Train Loss: 0.3936826951721131, Train Accuracy: 0.875, Val Loss: 1.592168642509798, Val Accuracy: 0.43999999999999995\n",
"Epoch: 35, Iter: 1, Train Loss: 0.39446093556125095, Train Accuracy: 0.8125, Val Loss: 1.6504423270813886, Val Accuracy: 0.43000000000000005\n",
"Epoch: 36, Iter: 1, Train Loss: 0.3917767876760818, Train Accuracy: 0.8125, Val Loss: 1.6894229810333048, Val Accuracy: 0.43000000000000005\n",
"Epoch: 37, Iter: 1, Train Loss: 0.3936299068718723, Train Accuracy: 0.8125, Val Loss: 1.7342536990495687, Val Accuracy: 0.43000000000000005\n",
"Epoch: 38, Iter: 1, Train Loss: 0.4086856463043926, Train Accuracy: 0.8125, Val Loss: 1.7709575584324264, Val Accuracy: 0.44999999999999996\n",
"Epoch: 39, Iter: 1, Train Loss: 0.3946728895715752, Train Accuracy: 0.8125, Val Loss: 1.8323990419424212, Val Accuracy: 0.43000000000000005\n",
"Epoch: 40, Iter: 1, Train Loss: 0.4092882424416999, Train Accuracy: 0.8125, Val Loss: 1.8647938002160964, Val Accuracy: 0.44999999999999996\n",
"Epoch: 41, Iter: 1, Train Loss: 0.4050641439255627, Train Accuracy: 0.8125, Val Loss: 1.891264442380163, Val Accuracy: 0.44999999999999996\n",
"Epoch: 42, Iter: 1, Train Loss: 0.4170644006779869, Train Accuracy: 0.8125, Val Loss: 1.9423174900115594, Val Accuracy: 0.44999999999999996\n",
"Epoch: 43, Iter: 1, Train Loss: 0.3923480753991977, Train Accuracy: 0.8125, Val Loss: 1.9731695043639572, Val Accuracy: 0.44999999999999996\n",
"Epoch: 44, Iter: 1, Train Loss: 0.40490676281916327, Train Accuracy: 0.8125, Val Loss: 2.010804834458905, Val Accuracy: 0.44999999999999996\n",
"Epoch: 45, Iter: 1, Train Loss: 0.40181821707001014, Train Accuracy: 0.8125, Val Loss: 2.051962004205519, Val Accuracy: 0.44999999999999996\n",
"Epoch: 46, Iter: 1, Train Loss: 0.40355348381441153, Train Accuracy: 0.8125, Val Loss: 2.0891022279849456, Val Accuracy: 0.44999999999999996\n",
"Epoch: 47, Iter: 1, Train Loss: 0.38715605504077866, Train Accuracy: 0.8125, Val Loss: 2.117280026954698, Val Accuracy: 0.44999999999999996\n",
"Epoch: 48, Iter: 1, Train Loss: 0.39836973023268446, Train Accuracy: 0.8125, Val Loss: 2.141835697116999, Val Accuracy: 0.43999999999999995\n",
"Epoch: 49, Iter: 1, Train Loss: 0.3901144594871556, Train Accuracy: 0.8125, Val Loss: 2.176511579483428, Val Accuracy: 0.43999999999999995\n",
"Epoch: 50, Iter: 1, Train Loss: 0.3917649057215277, Train Accuracy: 0.8125, Val Loss: 2.2288326304130806, Val Accuracy: 0.43999999999999995\n",
"NCF training completed after 50 epochs\n",
"SystemDS Statistics:\n",
"Total execution time:\t\t9.206 sec.\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"20/05/28 15:04:03 INFO api.DMLScript: BEGIN DML run 05/28/2020 15:04:03\n",
"20/05/28 15:04:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
"20/05/28 15:04:13 INFO api.DMLScript: END DML run 05/28/2020 15:04:13\n"
]
}
],
"source": [
"%%bash\n",
"cd ../../..\n",
"bin/systemds target/SystemDS.jar scripts/nn/examples/ncf-dummy-data.dml > scripts/nn/examples/run_log.txt && cat scripts/nn/examples/run_log.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### with real data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"cd ../../..\n",
"bin/systemds target/SystemDS.jar scripts/nn/examples/ncf-real-data.dml > scripts/nn/examples/run_log.txt && cat scripts/nn/examples/run_log.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ### Plot training results"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"log_name = \"run_log\"\n",
"txt_name = log_name + \".txt\"\n",
"csv_name = log_name + \".csv\"\n",
"\n",
"# convert to CSV\n",
"with open(txt_name, \"r\") as txt_file:\n",
" data = txt_file.readlines()\n",
" csv_lines = list(map(lambda x: x.replace(\"Epoch: \", \"\")\n",
" .replace(\", Iter: \", \",\")\n",
" .replace(\", Train Loss: \", \",\")\n",
" .replace(\", Train Accuracy: \", \",\")\n",
" .replace(\", Val Loss: \", \",\")\n",
" .replace(\", Val Accuracy: \", \",\"),\n",
" filter(lambda x: \"Epoch: \" in x, data)))\n",
" with open(csv_name, \"w\") as csv_file:\n",
" csv_file.write(\"epoch,iter,train_loss,train_acc,val_loss,val_acc\\n\")\n",
" for item in csv_lines:\n",
" csv_file.write(\"%s\" % item)\n",
"\n",
"# plot\n",
"log = pd.read_csv(csv_name)\n",
"plot_log = log[log[\"iter\"] == 1]\n",
"\n",
"for val in [\"train_loss\", \"train_acc\", \"val_loss\", \"val_acc\"]:\n",
" plt.plot(plot_log[\"epoch\"], plot_log[val], label=val)\n",
"\n",
"plt.legend()\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}