How to win at assignment 3

I'm going to show you one way of going about creating an "is vampire" function. For starters, I noticed that when I looked at the histograms of the data... they all looked pretty Gaussian (normal).

If you've taken a stats class, you may have seen regression. It works like this:

\[ p_{vampire} = \beta_{g} \cdot g + \beta_{s} \cdot s \]

The betas are the weights I assign to each of garlic and shiny.

So my regression problem is now this:

Depending on what you know about your data, there are different ways to do this. We're assuming here that our variables are Gaussian (looks pretty justifed from our plotting in Part 2) and that they are all independent (is this justified?). We've also got a special case here in that the underlying truth is binary: either you are a vampire, or you're not. So we're going to use something called logistic regression to find our betas. The details of this algorithm are beyond the background we assume for this class, so we'll let sklearn do the heavy lifting for us.

We'll start by loading the data. Instead of our function from asn3, I'll show you another handy way of loading CSVs using the wonderful Pandas data analysis library:

In []:
import pandas as pd
import numpy as np
data = pd.read_csv('big_data.csv', header=False)

Let's make sure that worked:

In [4]:
data.shape
Out[4]:
(19999, 8)

Looks good. Pandas loads things into a very powerful structure called a 'dataframe', but we just want a plain numpy matrix... so let's take care of that. First, let's load the three columns that I've decided are relevant: garlic aversion, stake aversion and reflectance (you might have made other choices -- that's ok, too!):

In [5]:
X=data.as_matrix()[:,[3,4,5]].astype(float)

Now let's load the vampire/not vampire column:

In [6]:
y=data.as_matrix()[:,7].astype(float)

This is standard notation in regression problems: your feature matrix is called \(X\) and your ground truth labels are called \(y\). Let's import sklearn and create a logistic regression object:

In [7]:
import sklearn.linear_model
logreg = sklearn.linear_model.LogisticRegression()

Can you guess how we fit the regression model? Just like we fit everything with sklearn:

In [8]:
logreg.fit(X, y)
Out[8]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Done! What I really want now is the value of those \(\beta\)s... I want to know how I should weight the various columns to do a good job of guessing who's a vampire:

In [9]:
print logreg.coef_
[[  7.65579245  -4.97821847 -13.83041574]]

Now I just take those values and plug them into the regression equation (which I've formulated as a dot product here)... and let that be my model!

In [10]:
def is_vampire(row):
	coeffs=numpy.array([[  7.65579245,  -4.97821847, -13.83041574]])
	return numpy.clip( numpy.dot(coeffs,row[[2,3,4]])[0],0.0001,0.9999)

This classifer still has a lot of problems, but it gets a log likelihood of -35 when testing on the small dataset (notice I trained with the big dataset and tested on the small... no double dipping!), which is pretty good.

There are a zoo of regression and regression-like methods for data of different types. We assumed normality and independence here, which often aren't true. If you want a completely general way of finding parameters for a probabilistic model, you can do that with what's called a Markov Chain Monte Carlo approach, which sounds quite a bit scarier than it actually is. As you might expect by now, Python has a nice package for doing this: PyMC.

In []: