{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"How to win at assignment 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm going to show you one way of going about creating an \"is vampire\" function. For starters, I noticed that when I looked at the histograms of the data... they all looked pretty Gaussian (normal). \n",
"\n",
"If you've taken a stats class, you may have seen *regression*. It works like this:\n",
"\n",
"- I have some variables, lets say 2, that each contain a set of data values (like an array). Maybe they look like this:\n",
" - `garlic=[0.1,0.4,0.2,0.1,0.9,0.8]`\n",
" - `shiny=[0.2,0.5,0.6,0.1,0.4,0.2]`\n",
"- I have some other variable, like \"probability of being a vampire\" that I don't know the value of... but I *think* its a function of `garlic` and `shiny`. Mathematially, I might write:\n",
"\n",
"$$ p_{vampire} = \\beta_{g} \\cdot g + \\beta_{s} \\cdot s $$\n",
"\n",
"The betas are the *weights* I assign to each of `garlic` and `shiny`.\n",
"\n",
"So my regression problem is now this:\n",
"\n",
"- Given the data in `garlic` and `shiny`, can I find values for $\\beta_g$ and $\\beta_s$ that will make my predictions of $p_{vampire}$ as good as possible on my training dataset?\n",
"\n",
"Depending on what you know about your data, there are different ways to do this. We're assuming here that our variables are Gaussian (looks pretty justifed from our plotting in Part 2) and that they are all *independent* (is *this* justified?). We've also got a special case here in that the underlying truth is binary: either you *are* a vampire, or you're *not*. So we're going to use something called *logistic regression* to find our betas. The details of this algorithm are beyond the background we assume for this class, so we'll let sklearn do the heavy lifting for us.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll start by loading the data. Instead of our function from asn3, I'll show you another handy way of loading CSVs using the wonderful [Pandas](http://pandas.pydata.org/) data analysis library:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"data = pd.read_csv('big_data.csv', header=False)\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make sure that worked:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"(19999, 8)"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good. Pandas loads things into a very powerful structure called a 'dataframe', but we just want a plain numpy matrix... so let's take care of that. First, let's load the three columns that I've decided are relevant: garlic aversion, stake aversion and reflectance (you might have made other choices -- that's ok, too!):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"X=data.as_matrix()[:,[3,4,5]].astype(float)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's load the vampire/not vampire column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"y=data.as_matrix()[:,7].astype(float)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is standard notation in regression problems: your feature matrix is called $X$ and your ground truth labels are called $y$. Let's import sklearn and create a logistic regression object:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sklearn.linear_model\n",
"logreg = sklearn.linear_model.LogisticRegression()\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can you guess how we fit the regression model? Just like we fit everything with sklearn:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"logreg.fit(X, y)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Done! What I really want now is the value of those $\\beta$s... I want to know how I should weight the various columns to do a good job of guessing who's a vampire:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print logreg.coef_"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 7.65579245 -4.97821847 -13.83041574]]\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I just take those values and plug them into the regression equation (which I've formulated as a dot product here)... and let that be my model! "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def is_vampire(row):\n",
"\tcoeffs=numpy.array([[ 7.65579245, -4.97821847, -13.83041574]])\n",
"\treturn numpy.clip( numpy.dot(coeffs,row[[2,3,4]])[0],0.0001,0.9999)\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This classifer still has a lot of problems, but it gets a log likelihood of -35 when testing on the small dataset (notice I *trained* with the big dataset and tested on the small... no double dipping!), which is pretty good."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a zoo of regression and regression-like methods for data of different types. We assumed normality and independence here, which often aren't true. If you want a *completely general* way of finding parameters for a probabilistic model, you can do that with what's called a [Markov Chain Monte Carlo](http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) approach, which sounds quite a bit scarier than it actually is. As you might expect by now, Python has a nice package for doing this: [PyMC](https://github.com/pymc-devs/pymc).\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}