public-engines/sms-spam-engine/notebooks/.ipynb_checkpoints/kaggle_solution-checkpoint.ipynb - incubator-marvin - Git at Google

 {
  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "0e236465-31d2-7cb7-e30c-534829e74e67",
     "_uuid": "2f70b948a40cdd1c0509f11a31b7a3efedaa04c3"
    },
    "source": [
     "# 1. Import libraries"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "8dc9f4a5-a803-262d-0a35-ec5b91f72e43",
     "_uuid": "5ab23e26dcda47d655562a8b95d5a01385140033",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "import pandas as pd\n",
     "import numpy as np\n",
     "import matplotlib.pyplot as plt\n",
     "import seaborn as sns\n",
     "%matplotlib inline\n",
     "import warnings\n",
     "warnings.filterwarnings('ignore')\n",
     "%config InlineBackend.figure_format = 'retina'"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "81b672d5-0878-f7da-0df3-ee123232c67d",
     "_uuid": "e157b186bbe4993607de3af64274eb85025028a5"
    },
    "source": [
     "# 2. Import data"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "add7a891-d611-71ac-3281-dfa60de1f2dc",
     "_uuid": "f3f86ca9e6d8058f9796ea523aad814277c7fc21",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "data = pd.read_csv(\"../input/spam.csv\",encoding='latin-1')"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "6c93ebf5-0a0e-b9b6-6796-204c159ba70b",
     "_uuid": "b8e4ed78bcf8b52dcf69d500f9d2da5255cc8d3c",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "data.head()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "871e252b-7ffb-90c4-269c-d4e7bbf035ad",
     "_uuid": "0ab087b694383f25aae4ffc45c2cbbba96535adf"
    },
    "source": [
     "Let's drop the unwanted columns, and rename the column name appropriately."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "e2bd3a08-1baf-a5fe-6972-28aafe69cf9f",
     "_uuid": "54cba7d73b971a7ec6c784b56f4f6a450b839e3d",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "#Drop column and name change\n",
     "data = data.drop([\"Unnamed: 2\", \"Unnamed: 3\", \"Unnamed: 4\"], axis=1)\n",
     "data = data.rename(columns={\"v1\":\"label\", \"v2\":\"text\"})"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "6daf143d-4159-7400-5d87-a19915a17c94",
     "_uuid": "68ccb0599f9e1bd83db3a5f58c19d43b9f24473e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "data.tail()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "9847a57c-6fd9-2399-0cdf-bd4825784c50",
     "_uuid": "13ccb713737b352e106221b44509dfa62cb41c1b",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "#Count observations in each label\n",
     "data.label.value_counts()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "6255fc1c-d6d2-b010-ca60-48ec68e99b57",
     "_uuid": "28aed93db300c2198d1b3ea440fc3a2f333f71f3",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "# convert label to a numerical variable\n",
     "data['label_num'] = data.label.map({'ham':0, 'spam':1})"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "46af4c23-1576-a0c8-13bd-e0bf1e9af787",
     "_uuid": "398b0cc4f42b561940f9a80bb659d657ab870c02",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "data.head()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "2317c5a3-550f-9298-2531-341c8b2c02fb",
     "_uuid": "146531a417febe4f37838cffa0cf3ec047b5ade7"
    },
    "source": [
     "# 3. Train Test Split\n",
     "Before performing text transformation, let us do train test split. Infact, we can perform k-Fold cross validation. However, due to simplicity, I am doing train test split."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "6965c489-3bff-84ff-33ba-e207c33ab672",
     "_uuid": "587d93e28457b4272f4923900af5dfca28a8c25c",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.model_selection import train_test_split"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "fc78d145-1f91-8f53-38e7-f47d9eacdfea",
     "_uuid": "c56deb78dfaf73b57c33c2ccd959986f5d85b5e8",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "X_train,X_test,y_train,y_test = train_test_split(data[\"text\"],data[\"label\"], test_size = 0.2, random_state = 10)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "390e158f-ded7-eaf9-36e2-7846b94c61e1",
     "_uuid": "f1a7ae3f590e8364cfcd13566caa5b4b34e30d80",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "print(X_train.shape)\n",
     "print(X_test.shape)\n",
     "print(y_train.shape)\n",
     "print(y_test.shape)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "b63d1a80-fd4f-1a28-f3bf-d81749997e65",
     "_uuid": "e5c8a80fe3ea9789a20d9e9f8300fc9cd33ec09e"
    },
    "source": [
     "# 4.Text Transformation\n",
     "Various text transformation techniques such as stop word removal, lowering the texts, tfidf transformations, prunning, stemming can be performed using sklearn.feature_extraction libraries. Then, the data can be convereted into bag-of-words. <br> <br>\n",
     "For this problem, Let us see how our model performs without removing stop words."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "ee2c40c7-3705-9c62-9053-f87334019640",
     "_uuid": "f1130c8eb1319ed5079b35edc554fae433065c24",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.feature_extraction.text import CountVectorizer"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "8a8b7de5-e44e-83f9-3a00-b5e1c77aeb35",
     "_uuid": "a7a60123c2f38a21bb390504ace62ec9e999e419",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "vect = CountVectorizer()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "744e6470-7e8f-7a6d-3ca1-f79729d109c2",
     "_uuid": "949cb9da8b209e54c84832b65f0c269c7616b908"
    },
    "source": [
     "Note : We can also perform tfidf transformation."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "4ed91daa-6e64-1334-7892-c53fe0cd5cde",
     "_uuid": "762d74b7a38c6aefde4f96b121a64b70bcda7b63",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "vect.fit(X_train)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "0a60d23b-3411-17db-b81f-573de71a3195",
     "_uuid": "4bf7198edef13a7d44f184f17eda7472d6263837"
    },
    "source": [
     "vect.fit function learns the vocabulary. We can get all the feature names from vect.get_feature_names( ). <br> <br> Let us print first and last twenty features"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "f594890e-494b-ba8c-bc63-9464ea73c2ba",
     "_uuid": "beec5fe5b93defb25bdf0d4613eeebf467c30cc2",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "print(vect.get_feature_names()[0:20])\n",
     "print(vect.get_feature_names()[-20:])"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "21355ab0-b085-5828-d232-702112f05257",
     "_uuid": "80720038b2da42d799b9a7eac6ab7a38814c2db5",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "X_train_df = vect.transform(X_train)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "8d814e33-4b00-4162-5910-860cb80a4b73",
     "_uuid": "6012f57d63853e11b5190d76f651263216644dc8"
    },
    "source": [
     "Now, let's transform the Test data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "72255bad-b2de-8e61-a9d1-fb30a471316c",
     "_uuid": "f21218291d18ce7a8a82c06a5bc83e5cfab751ff",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "X_test_df = vect.transform(X_test)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "21e70bf5-212f-f0cc-55f0-5ebf3002b52c",
     "_uuid": "a21ed2dbd7b904992ce277fb83fa6c17614f1350",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "type(X_test_df)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "f4579653-146b-d8a3-2020-ab5c2352ee43",
     "_uuid": "092c5771d38252190a72daa384d58e8351234252"
    },
    "source": [
     "# 5. Visualisations "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "b259ad8d-3f2c-6b73-e3d1-dccc7a7c3cbc",
     "_uuid": "83f5a78bbc4660052522697539d666e2b1adb8d8",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "ham_words = ''\n",
     "spam_words = ''\n",
     "spam = data[data.label_num == 1]\n",
     "ham = data[data.label_num ==0]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "09abe602-9963-208a-5e02-445556c69cd1",
     "_uuid": "07d085003ff7f43fc95420aca6a6ff8e673469ce",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "import nltk\n",
     "from nltk.corpus import stopwords"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "9d866a8d-5dcd-27ac-e9cf-6c6344922bb5",
     "_uuid": "a0952347a50be6c72cf9a1ebd00336e353babeca",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "for val in spam.text:\n",
     "    text = val.lower()\n",
     "    tokens = nltk.word_tokenize(text)\n",
     "    #tokens = [word for word in tokens if word not in stopwords.words('english')]\n",
     "    for words in tokens:\n",
     "        spam_words = spam_words + words + ' '\n",
     "        \n",
     "for val in ham.text:\n",
     "    text = val.lower()\n",
     "    tokens = nltk.word_tokenize(text)\n",
     "    for words in tokens:\n",
     "        ham_words = ham_words + words + ' '"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "4671e710-7166-7137-e9dc-a93812f8c3e6",
     "_uuid": "95db20d73c184cba527e9172abe363b356f9b854",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from wordcloud import WordCloud"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "72c88a67-41ab-8d3f-e467-d3f1bb1251cd",
     "_uuid": "c996a2412ce0c5bdb56ce20556c1748a0d183a54",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "# Generate a word cloud image\n",
     "spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)\n",
     "ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "5c3b5411-d662-7715-2840-d3305ae129b5",
     "_uuid": "1435692a6ffe5762e268e08c8b696927a22b6184",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "#Spam Word cloud\n",
     "plt.figure( figsize=(10,8), facecolor='k')\n",
     "plt.imshow(spam_wordcloud)\n",
     "plt.axis(\"off\")\n",
     "plt.tight_layout(pad=0)\n",
     "plt.show()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "9f90f7e2-f4b4-3509-6866-3ce7e9b856db",
     "_uuid": "0d6f1f6e663c1359fb336f5755ce4983b6a0a356",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "#Ham word cloud\n",
     "plt.figure( figsize=(10,8), facecolor='k')\n",
     "plt.imshow(ham_wordcloud)\n",
     "plt.axis(\"off\")\n",
     "plt.tight_layout(pad=0)\n",
     "plt.show()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "e879d59e-df84-8de5-51f4-7bb2c01cba6c",
     "_uuid": "b4cc63a28db28ea87b028196523f6e582440111d"
    },
    "source": [
     "# 6. Machine Learning models:"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "8f9074dc-58bb-e988-4e73-a08acf80f37e",
     "_uuid": "9eff04eb8124668b687dfb293f3231e4f22ce687"
    },
    "source": [
     "### 6.1 Multinomial Naive Bayes\n",
     "Generally, Naive Bayes works well on text data. Multinomail Naive bayes is best suited for classification with discrete features. "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "aadc8cdb-2abd-689d-2b2c-59a9a7933b75",
     "_uuid": "80068e9d09410a608151f2c546dc42e50c543e4e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction = dict()\n",
     "from sklearn.naive_bayes import MultinomialNB\n",
     "model = MultinomialNB()\n",
     "model.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "9522cfb7-c336-56a6-d792-857f9c8f959d",
     "_uuid": "91861728aeb7fa76a09916fa10c470921c537a18",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction[\"Multinomial\"] = model.predict(X_test_df)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "2afd9f38-ad59-7883-0d90-cbc04b76026b",
     "_uuid": "e38f8d32877bc097f7124433410b2d095b465c69",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.metrics import accuracy_score,confusion_matrix,classification_report"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "fb5f8959-c147-8279-7046-f09fbd1576ca",
     "_uuid": "ffecdca5e8bc7101822a643058e7ac32a4d3f2fc",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "accuracy_score(y_test,prediction[\"Multinomial\"])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "d149acba-bd73-7ed3-f121-ab8c113401b5",
     "_uuid": "cd37b86161040f1a57b9b2e4ce1697d2b8957717"
    },
    "source": [
     "### 6.2 Logistic Regression"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "e0111bd9-eda4-790a-9bf7-a47b0ce80804",
     "_uuid": "19c739ca58c7db3dd4ce1eede978f06e6c1f510f",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.linear_model import LogisticRegression\n",
     "model = LogisticRegression()\n",
     "model.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "094c403c-8fc4-5358-7a3b-b1b7a7428e72",
     "_uuid": "1f45c83eedc578d35e3032c5838f639c62f0885d",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction[\"Logistic\"] = model.predict(X_test_df)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "fa7cc0bd-8a38-d560-37bd-c5fcbd2e4596",
     "_uuid": "c9cfc924232273a227d239eb04b2069c02f988af",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "accuracy_score(y_test,prediction[\"Logistic\"])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "9d14463a-65b4-c019-82b8-d055a942feef",
     "_uuid": "7ac7398d59923ad8aea47446c60a94eb153fc6ff"
    },
    "source": [
     "### 6.3 $k$-NN classifier"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "01cbf471-292c-cb94-cfe0-96cd0c44a6e3",
     "_uuid": "bbae3006a875ee5abb9ad608a5b80641c5cba5aa",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.neighbors import KNeighborsClassifier\n",
     "model = KNeighborsClassifier(n_neighbors=5)\n",
     "model.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "ff308d10-f0aa-f829-d072-b94c4bd58849",
     "_uuid": "87d58a5b91db55f5068d94f90bae200b69497c00",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction[\"knn\"] = model.predict(X_test_df)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "d71f742a-753c-3b92-18a9-f12487d8b3e3",
     "_uuid": "c2c1614ddf62a888353c4a7a1f70639e9dd10555",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "accuracy_score(y_test,prediction[\"knn\"])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "507f25c8-331c-b20e-e8a1-502681b2367a",
     "_uuid": "c4f3bb84f3b9e78f292e0a54da17bb26866c5453"
    },
    "source": [
     "### 6.4 Ensemble classifier"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "23d25098-f81d-1e00-1cf2-3f93d2bee68a",
     "_uuid": "cb5b0203a75af5cf40e118759c3aa124f3d3f459",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.ensemble import RandomForestClassifier\n",
     "model = RandomForestClassifier()\n",
     "model.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "a968555e-a4e5-00af-453d-484424e15cda",
     "_uuid": "edbdb0ab03ee313d56e315b11fe9fde53c5cc524",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction[\"random_forest\"] = model.predict(X_test_df)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "047d881b-31b3-f454-9bac-ca9a1c8b21dc",
     "_uuid": "17d6302e0690fdd20f6f322a6cb76909bf2501b9",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "accuracy_score(y_test,prediction[\"random_forest\"])"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "a81dac84-49ff-1059-8575-d86d20535042",
     "_uuid": "eb1ad5de054c7bbfe84376243230b4812311898e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.ensemble import AdaBoostClassifier\n",
     "model = AdaBoostClassifier()\n",
     "model.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "95d4ca05-b491-a494-9270-7f649bc1576f",
     "_uuid": "18356400fa9beb7930f966862207f6b2b25dce1b",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "prediction[\"adaboost\"] = model.predict(X_test_df)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "6ab20ad0-5e7f-6036-587d-14d29464dba1",
     "_uuid": "c651ee1d1f6ca35979e7c7257f91dc9580f98c36",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "accuracy_score(y_test,prediction[\"adaboost\"])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "a5b41c54-202e-ee82-46c8-7062ec8f39f6",
     "_uuid": "6f9c702a4ca8bea2116879ba95f98d12ea375bce"
    },
    "source": [
     "# 7. Parameter Tuning using GridSearchCV"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "f319b02c-d073-2782-a781-db4bd5bbac62",
     "_uuid": "82f676629eaa017ceca8403a4f59e5e29c4bf4bd"
    },
    "source": [
     "Based, on the above four ML models, Naive Bayes has given the best accuracy. However, Let's try to tune the parameters of $k$-NN using GridSearchCV"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "5da3218e-9506-59ea-0c54-c4c179651c84",
     "_uuid": "cafe0cdc50af08e2355fbef315243f0c1738268e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "from sklearn.model_selection import GridSearchCV"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "940406a0-5e5e-3b16-975e-860762606b50",
     "_uuid": "e6bf5ae4e0d7724521273f420935e590a8e1309e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "k_range = np.arange(1,30)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "b343564d-17a7-933b-b571-b3985f7243f7",
     "_uuid": "2582e2f4bcc415d26b6658ea0bfec47dc9820774",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "k_range"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "44d33a5a-d0cf-1bcf-e9ed-721e419ebad7",
     "_uuid": "0f2c9349af2195a7cb188e11e49b659297ce8e8f",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "param_grid = dict(n_neighbors=k_range)\n",
     "print(param_grid)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "cd25df1b-55fd-e5e6-b955-8cf8e6120082",
     "_uuid": "a83be47466de6c0b7d92cfa75ebb2f37fbf4e488",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "model = KNeighborsClassifier()\n",
     "grid = GridSearchCV(model,param_grid)\n",
     "grid.fit(X_train_df,y_train)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "873d5aa6-4a01-c581-7f0b-231cd3d9cd10",
     "_uuid": "7c034512f2f49d084588e3010673707d9ca63c46",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "grid.best_estimator_"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "443e4e3b-fb0b-0222-aab7-1b3c58b41431",
     "_uuid": "796568a1153e5373d3c152b208f83ce793e23d4e",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "grid.best_params_"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "18b22a0e-020a-3433-651b-abb10e11b025",
     "_uuid": "a9eb14f34dad56f4e0ea3b918f8dd3a612cc7caa",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "grid.best_score_"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "cec407ca-e4ec-b0d2-015a-d2fe9ab3a5dd",
     "_uuid": "f17d12e12fbcbf82e9ce8d6b38bab05a09abb16c",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "grid.grid_scores_"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "7ed75f56-aa3b-0bf0-76e7-1eaea1cc1e6a",
     "_uuid": "9495cbc4b58574e79af940024d5d8c8b8b2e9f79"
    },
    "source": [
     "# 8. Model Evaluation"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "ec8655f6-af1f-bedf-2a7b-bceb1475da94",
     "_uuid": "26679dca236c4c51596f1fa3dcce210d693cc6fc",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "print(classification_report(y_test, prediction['Multinomial'], target_names = [\"Ham\", \"Spam\"]))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "1add69ef-9258-6fd0-696a-cb41b8dafd15",
     "_uuid": "fd413c8606a62bb7f3e05b1b634ef33608e3cc01",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "conf_mat = confusion_matrix(y_test, prediction['Multinomial'])\n",
     "conf_mat_normalized = conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "16646402-2915-7e85-f5ac-2c48ad83b056",
     "_uuid": "b5bb0af95b9002ea4930987af46f2e357101b8dd",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "sns.heatmap(conf_mat_normalized)\n",
     "plt.ylabel('True label')\n",
     "plt.xlabel('Predicted label')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "c25278ec-d981-14f7-cc4e-c5e7c4f9f150",
     "_uuid": "ee88015ea8de01b8456f99fd7bc8ae7c4dafc09d"
    },
    "source": [
     "# 9. Future works"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "639ac90d-dd9d-fbae-a4af-c34c7f6185a7",
     "_uuid": "5c2c416fbf21e7aa0ac7a55352ebd6147fd30be9",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "print(conf_mat)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "95893968-407f-da85-dd49-9c6d2009f8d0",
     "_uuid": "2a5c9244a0196cc6fd5a13eca80ec3e9e940274e"
    },
    "source": [
     "By seeing the above confusion matrix, it is clear that 5 Ham are mis classified as Spam, and 8 Spam are misclassified as Ham. Let'see what are those misclassified text messages. Looking those messages may help us to come up with more advanced feature engineering."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "4e8e1fd3-718d-77bf-57f6-4773530a43ce",
     "_uuid": "46644025c8daf2b8a4ca37dd8175cd2ae8f44e18",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "pd.set_option('display.max_colwidth', -1)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "07206435-e5c8-4baa-a27e-de171143867e",
     "_uuid": "ead1dbbc7748afb0976946fd26c3263d8136334a"
    },
    "source": [
     "I increased the pandas dataframe width to display the misclassified texts in full width. "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "71d65dcf-1ca7-e954-9f19-c75b227e5266",
     "_uuid": "f0ac9c328da1e382535e31169f8f7da5a0135aa2"
    },
    "source": [
     "### 9.1 Misclassified as Spam"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "909266b9-0610-bd47-6163-f59586ef7f99",
     "_uuid": "44122bdd296504c391690644db0de23c26c992be",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "X_test[y_test < prediction[\"Multinomial\"] ]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "dcea4ad9-894b-794c-897c-b34ff64cdb9d",
     "_uuid": "d0c3aaafb48fbfcc369211b340bd85b7cda3f954"
    },
    "source": [
     "### 9.2 Misclassfied as Ham"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "9a8ea5f0-c67c-6864-7407-fd8a24985793",
     "_uuid": "d97d93d90003beceba56b2023043dd5828944844",
     "collapsed": true
    },
    "outputs": [],
    "source": [
     "X_test[y_test > prediction[\"Multinomial\"] ]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "_cell_guid": "67ccd623-3ff8-0b0b-0c58-561534d4e336",
     "_uuid": "1c8da036b1e7b1ae6501148d2e4e1cac22b7f5d3"
    },
    "source": [
     "It seems length of the spam text is much higher than the ham. Maybe we can include length as a feature.  In addition to unigram, we can also try bigram features. "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
     "_cell_guid": "c9728d83-d619-8663-31a7-af7a45ad6a3e",
     "_uuid": "5e40a5ad87d67facdf6d15d5b4f99b936984bb73",
     "collapsed": true
    },
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "_change_revision": 0,
   "_is_fork": false,
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.6.4"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 1
 }