{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Application of Bayes' Theorem: Classifying Spam Email\n", "\n", "## INFO 2301, Spring 2019" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### Loading the probabilities\n", "\n", "The following block of code reads in probabilities that have already been calculated. These were calculated from the SMS spam classification dataset in the [UCI machine learning repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).\n", "\n", "The first part of the code reads in the conditional probability tables from the file, `naive_bayes_probs.tsv`. You should place this file in the same directory as this notebook, in order for it to be accessible by the code below.\n", "\n", "The conditional probabilities are stored in a two-dimensional dictionary called `conditional_probabilities`. The first key is the category (either \"spam\" or \"ham\"), and the second key is the word. The value is the probability of the word given the category. For example:\n", "\n", "* `conditional_probabilities['spam']['claim']` contains the probability, $P(word=\\textrm{\"claim\"} | category=\\textrm{\"spam\"})$\n", "* `conditional_probabilities['ham']['said']` contains the probability, $P(word=\\textrm{\"said\"} | category=\\textrm{\"ham\"})$\n", "\n", "There is additional code to define the marginal probabilities of the two categories, $P(category=\\textrm{\"spam\"})$ and $P(category=\\textrm{\"ham\"})$. Since there were only two classes and they do not change, these values were simply written directly into the Python code rather than being loaded from a file. These probabilities are stored in a dictionary called `marginal_probabilities`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "conditional_probabilities = {}\n", "conditional_probabilities['ham'] = {}\n", "conditional_probabilities['spam'] = {}\n", "\n", "marginal_probabilities = {}\n", "marginal_probabilities['ham'] = 0.866\n", "marginal_probabilities['spam'] = 0.134\n", "\n", "words = set()\n", "\n", "for line in open('naive_bayes_probs.tsv'): # tab-separated values file\n", " cols = line.strip().split('\\t')\n", " \n", " label = cols[0]\n", " word = cols[1]\n", " prob = float(cols[2])\n", " \n", " words.add(word)\n", " conditional_probabilities[label][word] = prob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bayes' theorem calculation\n", "\n", "When you run the cell below, it asks you to enter a word as input and then prints the probability that the word belongs to spam. The probability is calculated using Bayes' theorem: the probabilities loaded above correspond to $P(word|category)$, while the code below calculates $P(category|word)$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "word = input()\n", "\n", "if word in words:\n", " numerator = conditional_probabilities['spam'][word] * marginal_probabilities['spam']\n", " denominator = numerator + (conditional_probabilities['ham'][word] * marginal_probabilities['ham'])\n", " p = numerator / denominator\n", " \n", " print(p, 'probability of spam')\n", "else:\n", " print(word, 'not in dataset, probability unknown')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Extra Credit [up to 6 points]\n", "\n", "Write code to determine the 10 words with the highest probability of being spam $P(word|category=\\textrm{\"spam\"})$ and the 10 words with the highest probability of being ham $P(word|category=\\textrm{\"ham\"})$.\n", "\n", "If you wish to do this optional problem for extra credit, submit this notebook with your solution by **5pm Tuesday April 16.** Late submissions will not be graded. You may work with up to 3 collaborators, and you must include their names. It is fine to use resources like StackOverflow to help you with this problem (for example, you may need to look up how to sort a dictionary in order to get the top 10 values), but you should cite any resources you use. Partial credit be given to incorrect solutions that make partial progress. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Solution goes here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }