| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "0e236465-31d2-7cb7-e30c-534829e74e67", |
| "_uuid": "2f70b948a40cdd1c0509f11a31b7a3efedaa04c3" |
| }, |
| "source": [ |
| "# 1. Import libraries" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "8dc9f4a5-a803-262d-0a35-ec5b91f72e43", |
| "_uuid": "5ab23e26dcda47d655562a8b95d5a01385140033", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "import pandas as pd\n", |
| "import numpy as np\n", |
| "import matplotlib.pyplot as plt\n", |
| "import seaborn as sns\n", |
| "%matplotlib inline\n", |
| "import warnings\n", |
| "warnings.filterwarnings('ignore')\n", |
| "%config InlineBackend.figure_format = 'retina'" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "81b672d5-0878-f7da-0df3-ee123232c67d", |
| "_uuid": "e157b186bbe4993607de3af64274eb85025028a5" |
| }, |
| "source": [ |
| "# 2. Import data" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "add7a891-d611-71ac-3281-dfa60de1f2dc", |
| "_uuid": "f3f86ca9e6d8058f9796ea523aad814277c7fc21", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "data = pd.read_csv(\"../input/spam.csv\",encoding='latin-1')" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "6c93ebf5-0a0e-b9b6-6796-204c159ba70b", |
| "_uuid": "b8e4ed78bcf8b52dcf69d500f9d2da5255cc8d3c", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "data.head()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "871e252b-7ffb-90c4-269c-d4e7bbf035ad", |
| "_uuid": "0ab087b694383f25aae4ffc45c2cbbba96535adf" |
| }, |
| "source": [ |
| "Let's drop the unwanted columns, and rename the column name appropriately." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "e2bd3a08-1baf-a5fe-6972-28aafe69cf9f", |
| "_uuid": "54cba7d73b971a7ec6c784b56f4f6a450b839e3d", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "#Drop column and name change\n", |
| "data = data.drop([\"Unnamed: 2\", \"Unnamed: 3\", \"Unnamed: 4\"], axis=1)\n", |
| "data = data.rename(columns={\"v1\":\"label\", \"v2\":\"text\"})" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "6daf143d-4159-7400-5d87-a19915a17c94", |
| "_uuid": "68ccb0599f9e1bd83db3a5f58c19d43b9f24473e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "data.tail()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "9847a57c-6fd9-2399-0cdf-bd4825784c50", |
| "_uuid": "13ccb713737b352e106221b44509dfa62cb41c1b", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "#Count observations in each label\n", |
| "data.label.value_counts()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "6255fc1c-d6d2-b010-ca60-48ec68e99b57", |
| "_uuid": "28aed93db300c2198d1b3ea440fc3a2f333f71f3", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "# convert label to a numerical variable\n", |
| "data['label_num'] = data.label.map({'ham':0, 'spam':1})" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "46af4c23-1576-a0c8-13bd-e0bf1e9af787", |
| "_uuid": "398b0cc4f42b561940f9a80bb659d657ab870c02", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "data.head()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "2317c5a3-550f-9298-2531-341c8b2c02fb", |
| "_uuid": "146531a417febe4f37838cffa0cf3ec047b5ade7" |
| }, |
| "source": [ |
| "# 3. Train Test Split\n", |
| "Before performing text transformation, let us do train test split. Infact, we can perform k-Fold cross validation. However, due to simplicity, I am doing train test split." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "6965c489-3bff-84ff-33ba-e207c33ab672", |
| "_uuid": "587d93e28457b4272f4923900af5dfca28a8c25c", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.model_selection import train_test_split" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "fc78d145-1f91-8f53-38e7-f47d9eacdfea", |
| "_uuid": "c56deb78dfaf73b57c33c2ccd959986f5d85b5e8", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "X_train,X_test,y_train,y_test = train_test_split(data[\"text\"],data[\"label\"], test_size = 0.2, random_state = 10)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "390e158f-ded7-eaf9-36e2-7846b94c61e1", |
| "_uuid": "f1a7ae3f590e8364cfcd13566caa5b4b34e30d80", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "print(X_train.shape)\n", |
| "print(X_test.shape)\n", |
| "print(y_train.shape)\n", |
| "print(y_test.shape)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "b63d1a80-fd4f-1a28-f3bf-d81749997e65", |
| "_uuid": "e5c8a80fe3ea9789a20d9e9f8300fc9cd33ec09e" |
| }, |
| "source": [ |
| "# 4.Text Transformation\n", |
| "Various text transformation techniques such as stop word removal, lowering the texts, tfidf transformations, prunning, stemming can be performed using sklearn.feature_extraction libraries. Then, the data can be convereted into bag-of-words. <br> <br>\n", |
| "For this problem, Let us see how our model performs without removing stop words." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "ee2c40c7-3705-9c62-9053-f87334019640", |
| "_uuid": "f1130c8eb1319ed5079b35edc554fae433065c24", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.feature_extraction.text import CountVectorizer" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "8a8b7de5-e44e-83f9-3a00-b5e1c77aeb35", |
| "_uuid": "a7a60123c2f38a21bb390504ace62ec9e999e419", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "vect = CountVectorizer()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "744e6470-7e8f-7a6d-3ca1-f79729d109c2", |
| "_uuid": "949cb9da8b209e54c84832b65f0c269c7616b908" |
| }, |
| "source": [ |
| "Note : We can also perform tfidf transformation." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "4ed91daa-6e64-1334-7892-c53fe0cd5cde", |
| "_uuid": "762d74b7a38c6aefde4f96b121a64b70bcda7b63", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "vect.fit(X_train)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "0a60d23b-3411-17db-b81f-573de71a3195", |
| "_uuid": "4bf7198edef13a7d44f184f17eda7472d6263837" |
| }, |
| "source": [ |
| "vect.fit function learns the vocabulary. We can get all the feature names from vect.get_feature_names( ). <br> <br> Let us print first and last twenty features" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "f594890e-494b-ba8c-bc63-9464ea73c2ba", |
| "_uuid": "beec5fe5b93defb25bdf0d4613eeebf467c30cc2", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "print(vect.get_feature_names()[0:20])\n", |
| "print(vect.get_feature_names()[-20:])" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "21355ab0-b085-5828-d232-702112f05257", |
| "_uuid": "80720038b2da42d799b9a7eac6ab7a38814c2db5", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "X_train_df = vect.transform(X_train)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "8d814e33-4b00-4162-5910-860cb80a4b73", |
| "_uuid": "6012f57d63853e11b5190d76f651263216644dc8" |
| }, |
| "source": [ |
| "Now, let's transform the Test data." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "72255bad-b2de-8e61-a9d1-fb30a471316c", |
| "_uuid": "f21218291d18ce7a8a82c06a5bc83e5cfab751ff", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "X_test_df = vect.transform(X_test)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "21e70bf5-212f-f0cc-55f0-5ebf3002b52c", |
| "_uuid": "a21ed2dbd7b904992ce277fb83fa6c17614f1350", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "type(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "f4579653-146b-d8a3-2020-ab5c2352ee43", |
| "_uuid": "092c5771d38252190a72daa384d58e8351234252" |
| }, |
| "source": [ |
| "# 5. Visualisations " |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "b259ad8d-3f2c-6b73-e3d1-dccc7a7c3cbc", |
| "_uuid": "83f5a78bbc4660052522697539d666e2b1adb8d8", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "ham_words = ''\n", |
| "spam_words = ''\n", |
| "spam = data[data.label_num == 1]\n", |
| "ham = data[data.label_num ==0]" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "09abe602-9963-208a-5e02-445556c69cd1", |
| "_uuid": "07d085003ff7f43fc95420aca6a6ff8e673469ce", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "import nltk\n", |
| "from nltk.corpus import stopwords" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "9d866a8d-5dcd-27ac-e9cf-6c6344922bb5", |
| "_uuid": "a0952347a50be6c72cf9a1ebd00336e353babeca", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "for val in spam.text:\n", |
| " text = val.lower()\n", |
| " tokens = nltk.word_tokenize(text)\n", |
| " #tokens = [word for word in tokens if word not in stopwords.words('english')]\n", |
| " for words in tokens:\n", |
| " spam_words = spam_words + words + ' '\n", |
| " \n", |
| "for val in ham.text:\n", |
| " text = val.lower()\n", |
| " tokens = nltk.word_tokenize(text)\n", |
| " for words in tokens:\n", |
| " ham_words = ham_words + words + ' '" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "4671e710-7166-7137-e9dc-a93812f8c3e6", |
| "_uuid": "95db20d73c184cba527e9172abe363b356f9b854", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from wordcloud import WordCloud" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "72c88a67-41ab-8d3f-e467-d3f1bb1251cd", |
| "_uuid": "c996a2412ce0c5bdb56ce20556c1748a0d183a54", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "# Generate a word cloud image\n", |
| "spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)\n", |
| "ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "5c3b5411-d662-7715-2840-d3305ae129b5", |
| "_uuid": "1435692a6ffe5762e268e08c8b696927a22b6184", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "#Spam Word cloud\n", |
| "plt.figure( figsize=(10,8), facecolor='k')\n", |
| "plt.imshow(spam_wordcloud)\n", |
| "plt.axis(\"off\")\n", |
| "plt.tight_layout(pad=0)\n", |
| "plt.show()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "9f90f7e2-f4b4-3509-6866-3ce7e9b856db", |
| "_uuid": "0d6f1f6e663c1359fb336f5755ce4983b6a0a356", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "#Ham word cloud\n", |
| "plt.figure( figsize=(10,8), facecolor='k')\n", |
| "plt.imshow(ham_wordcloud)\n", |
| "plt.axis(\"off\")\n", |
| "plt.tight_layout(pad=0)\n", |
| "plt.show()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "e879d59e-df84-8de5-51f4-7bb2c01cba6c", |
| "_uuid": "b4cc63a28db28ea87b028196523f6e582440111d" |
| }, |
| "source": [ |
| "# 6. Machine Learning models:" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "8f9074dc-58bb-e988-4e73-a08acf80f37e", |
| "_uuid": "9eff04eb8124668b687dfb293f3231e4f22ce687" |
| }, |
| "source": [ |
| "### 6.1 Multinomial Naive Bayes\n", |
| "Generally, Naive Bayes works well on text data. Multinomail Naive bayes is best suited for classification with discrete features. " |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "aadc8cdb-2abd-689d-2b2c-59a9a7933b75", |
| "_uuid": "80068e9d09410a608151f2c546dc42e50c543e4e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction = dict()\n", |
| "from sklearn.naive_bayes import MultinomialNB\n", |
| "model = MultinomialNB()\n", |
| "model.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "9522cfb7-c336-56a6-d792-857f9c8f959d", |
| "_uuid": "91861728aeb7fa76a09916fa10c470921c537a18", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction[\"Multinomial\"] = model.predict(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "2afd9f38-ad59-7883-0d90-cbc04b76026b", |
| "_uuid": "e38f8d32877bc097f7124433410b2d095b465c69", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.metrics import accuracy_score,confusion_matrix,classification_report" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "fb5f8959-c147-8279-7046-f09fbd1576ca", |
| "_uuid": "ffecdca5e8bc7101822a643058e7ac32a4d3f2fc", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "accuracy_score(y_test,prediction[\"Multinomial\"])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "d149acba-bd73-7ed3-f121-ab8c113401b5", |
| "_uuid": "cd37b86161040f1a57b9b2e4ce1697d2b8957717" |
| }, |
| "source": [ |
| "### 6.2 Logistic Regression" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "e0111bd9-eda4-790a-9bf7-a47b0ce80804", |
| "_uuid": "19c739ca58c7db3dd4ce1eede978f06e6c1f510f", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.linear_model import LogisticRegression\n", |
| "model = LogisticRegression()\n", |
| "model.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "094c403c-8fc4-5358-7a3b-b1b7a7428e72", |
| "_uuid": "1f45c83eedc578d35e3032c5838f639c62f0885d", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction[\"Logistic\"] = model.predict(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "fa7cc0bd-8a38-d560-37bd-c5fcbd2e4596", |
| "_uuid": "c9cfc924232273a227d239eb04b2069c02f988af", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "accuracy_score(y_test,prediction[\"Logistic\"])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "9d14463a-65b4-c019-82b8-d055a942feef", |
| "_uuid": "7ac7398d59923ad8aea47446c60a94eb153fc6ff" |
| }, |
| "source": [ |
| "### 6.3 $k$-NN classifier" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "01cbf471-292c-cb94-cfe0-96cd0c44a6e3", |
| "_uuid": "bbae3006a875ee5abb9ad608a5b80641c5cba5aa", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.neighbors import KNeighborsClassifier\n", |
| "model = KNeighborsClassifier(n_neighbors=5)\n", |
| "model.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "ff308d10-f0aa-f829-d072-b94c4bd58849", |
| "_uuid": "87d58a5b91db55f5068d94f90bae200b69497c00", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction[\"knn\"] = model.predict(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "d71f742a-753c-3b92-18a9-f12487d8b3e3", |
| "_uuid": "c2c1614ddf62a888353c4a7a1f70639e9dd10555", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "accuracy_score(y_test,prediction[\"knn\"])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "507f25c8-331c-b20e-e8a1-502681b2367a", |
| "_uuid": "c4f3bb84f3b9e78f292e0a54da17bb26866c5453" |
| }, |
| "source": [ |
| "### 6.4 Ensemble classifier" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "23d25098-f81d-1e00-1cf2-3f93d2bee68a", |
| "_uuid": "cb5b0203a75af5cf40e118759c3aa124f3d3f459", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.ensemble import RandomForestClassifier\n", |
| "model = RandomForestClassifier()\n", |
| "model.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "a968555e-a4e5-00af-453d-484424e15cda", |
| "_uuid": "edbdb0ab03ee313d56e315b11fe9fde53c5cc524", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction[\"random_forest\"] = model.predict(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "047d881b-31b3-f454-9bac-ca9a1c8b21dc", |
| "_uuid": "17d6302e0690fdd20f6f322a6cb76909bf2501b9", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "accuracy_score(y_test,prediction[\"random_forest\"])" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "a81dac84-49ff-1059-8575-d86d20535042", |
| "_uuid": "eb1ad5de054c7bbfe84376243230b4812311898e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.ensemble import AdaBoostClassifier\n", |
| "model = AdaBoostClassifier()\n", |
| "model.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "95d4ca05-b491-a494-9270-7f649bc1576f", |
| "_uuid": "18356400fa9beb7930f966862207f6b2b25dce1b", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "prediction[\"adaboost\"] = model.predict(X_test_df)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "6ab20ad0-5e7f-6036-587d-14d29464dba1", |
| "_uuid": "c651ee1d1f6ca35979e7c7257f91dc9580f98c36", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "accuracy_score(y_test,prediction[\"adaboost\"])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "a5b41c54-202e-ee82-46c8-7062ec8f39f6", |
| "_uuid": "6f9c702a4ca8bea2116879ba95f98d12ea375bce" |
| }, |
| "source": [ |
| "# 7. Parameter Tuning using GridSearchCV" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "f319b02c-d073-2782-a781-db4bd5bbac62", |
| "_uuid": "82f676629eaa017ceca8403a4f59e5e29c4bf4bd" |
| }, |
| "source": [ |
| "Based, on the above four ML models, Naive Bayes has given the best accuracy. However, Let's try to tune the parameters of $k$-NN using GridSearchCV" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "5da3218e-9506-59ea-0c54-c4c179651c84", |
| "_uuid": "cafe0cdc50af08e2355fbef315243f0c1738268e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "from sklearn.model_selection import GridSearchCV" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "940406a0-5e5e-3b16-975e-860762606b50", |
| "_uuid": "e6bf5ae4e0d7724521273f420935e590a8e1309e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "k_range = np.arange(1,30)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "b343564d-17a7-933b-b571-b3985f7243f7", |
| "_uuid": "2582e2f4bcc415d26b6658ea0bfec47dc9820774", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "k_range" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "44d33a5a-d0cf-1bcf-e9ed-721e419ebad7", |
| "_uuid": "0f2c9349af2195a7cb188e11e49b659297ce8e8f", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "param_grid = dict(n_neighbors=k_range)\n", |
| "print(param_grid)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "cd25df1b-55fd-e5e6-b955-8cf8e6120082", |
| "_uuid": "a83be47466de6c0b7d92cfa75ebb2f37fbf4e488", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "model = KNeighborsClassifier()\n", |
| "grid = GridSearchCV(model,param_grid)\n", |
| "grid.fit(X_train_df,y_train)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "873d5aa6-4a01-c581-7f0b-231cd3d9cd10", |
| "_uuid": "7c034512f2f49d084588e3010673707d9ca63c46", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "grid.best_estimator_" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "443e4e3b-fb0b-0222-aab7-1b3c58b41431", |
| "_uuid": "796568a1153e5373d3c152b208f83ce793e23d4e", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "grid.best_params_" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "18b22a0e-020a-3433-651b-abb10e11b025", |
| "_uuid": "a9eb14f34dad56f4e0ea3b918f8dd3a612cc7caa", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "grid.best_score_" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "cec407ca-e4ec-b0d2-015a-d2fe9ab3a5dd", |
| "_uuid": "f17d12e12fbcbf82e9ce8d6b38bab05a09abb16c", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "grid.grid_scores_" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "7ed75f56-aa3b-0bf0-76e7-1eaea1cc1e6a", |
| "_uuid": "9495cbc4b58574e79af940024d5d8c8b8b2e9f79" |
| }, |
| "source": [ |
| "# 8. Model Evaluation" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "ec8655f6-af1f-bedf-2a7b-bceb1475da94", |
| "_uuid": "26679dca236c4c51596f1fa3dcce210d693cc6fc", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "print(classification_report(y_test, prediction['Multinomial'], target_names = [\"Ham\", \"Spam\"]))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "1add69ef-9258-6fd0-696a-cb41b8dafd15", |
| "_uuid": "fd413c8606a62bb7f3e05b1b634ef33608e3cc01", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "conf_mat = confusion_matrix(y_test, prediction['Multinomial'])\n", |
| "conf_mat_normalized = conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "16646402-2915-7e85-f5ac-2c48ad83b056", |
| "_uuid": "b5bb0af95b9002ea4930987af46f2e357101b8dd", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "sns.heatmap(conf_mat_normalized)\n", |
| "plt.ylabel('True label')\n", |
| "plt.xlabel('Predicted label')" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "c25278ec-d981-14f7-cc4e-c5e7c4f9f150", |
| "_uuid": "ee88015ea8de01b8456f99fd7bc8ae7c4dafc09d" |
| }, |
| "source": [ |
| "# 9. Future works" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "639ac90d-dd9d-fbae-a4af-c34c7f6185a7", |
| "_uuid": "5c2c416fbf21e7aa0ac7a55352ebd6147fd30be9", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "print(conf_mat)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "95893968-407f-da85-dd49-9c6d2009f8d0", |
| "_uuid": "2a5c9244a0196cc6fd5a13eca80ec3e9e940274e" |
| }, |
| "source": [ |
| "By seeing the above confusion matrix, it is clear that 5 Ham are mis classified as Spam, and 8 Spam are misclassified as Ham. Let'see what are those misclassified text messages. Looking those messages may help us to come up with more advanced feature engineering." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "4e8e1fd3-718d-77bf-57f6-4773530a43ce", |
| "_uuid": "46644025c8daf2b8a4ca37dd8175cd2ae8f44e18", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "pd.set_option('display.max_colwidth', -1)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "07206435-e5c8-4baa-a27e-de171143867e", |
| "_uuid": "ead1dbbc7748afb0976946fd26c3263d8136334a" |
| }, |
| "source": [ |
| "I increased the pandas dataframe width to display the misclassified texts in full width. " |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "71d65dcf-1ca7-e954-9f19-c75b227e5266", |
| "_uuid": "f0ac9c328da1e382535e31169f8f7da5a0135aa2" |
| }, |
| "source": [ |
| "### 9.1 Misclassified as Spam" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "909266b9-0610-bd47-6163-f59586ef7f99", |
| "_uuid": "44122bdd296504c391690644db0de23c26c992be", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "X_test[y_test < prediction[\"Multinomial\"] ]" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "dcea4ad9-894b-794c-897c-b34ff64cdb9d", |
| "_uuid": "d0c3aaafb48fbfcc369211b340bd85b7cda3f954" |
| }, |
| "source": [ |
| "### 9.2 Misclassfied as Ham" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "9a8ea5f0-c67c-6864-7407-fd8a24985793", |
| "_uuid": "d97d93d90003beceba56b2023043dd5828944844", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [ |
| "X_test[y_test > prediction[\"Multinomial\"] ]" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": { |
| "_cell_guid": "67ccd623-3ff8-0b0b-0c58-561534d4e336", |
| "_uuid": "1c8da036b1e7b1ae6501148d2e4e1cac22b7f5d3" |
| }, |
| "source": [ |
| "It seems length of the spam text is much higher than the ham. Maybe we can include length as a feature. In addition to unigram, we can also try bigram features. " |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "metadata": { |
| "_cell_guid": "c9728d83-d619-8663-31a7-af7a45ad6a3e", |
| "_uuid": "5e40a5ad87d67facdf6d15d5b4f99b936984bb73", |
| "collapsed": true |
| }, |
| "outputs": [], |
| "source": [] |
| } |
| ], |
| "metadata": { |
| "_change_revision": 0, |
| "_is_fork": false, |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.6.4" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 1 |
| } |