
Enron was one of the worlds major natural gas, electricity, communications and pulp and paper companies in the world, which according to [Forbes] claimed a revenue of 111\$ billion during 2000, up from 13.8\$ billion in 1996 [Enron Red Flags]. The rise of Enron was largely a lie and the company declared bankruptcy in 2001. According to analysis of Enrons financial statements, much of their reported revenue came from what is known as "mark-to-market" or MTM, which basically means that they reported any projected earnings as real earnings, creating the illusion that Enron was one of the most profitable energy and communications companies in the world.
What will follow here is an analysis of the Enron dataset - which is freely available at https://www.cs.cmu.edu/~./enron/ - containing a couple of hundred thousand emails between Enron employees, and finacial reports of income, both payments and bonuses, as well as a list of people marked as person-of-interest (POI) in the Enron scam/lie. It should be noted, that according to an article by Julia La Roche in [Business Insider] on Andrew Fastow, the activities which led to the rise and fall of Enron aren't unique. According to Fastow there are plenty of companies, large and small, that use the same shady/unethical/immoral financial techniques, most notably Apple.
The analysis of the Enron dataset, both financial and email, will be the foundation for choosing input features for a machine learning algorithm, which sole purpose is to predict whether a person was part of the scam/lie.
To do the POI predictions, various machine learning (ML) algorithms (MLAs) will be tested on a subset of the data, with a variable set of features and the pros and cons of each algorithm will be described shortly. The reason a normal exploratory analysis of the data wouldn't be as efficient in the case of this data, is because of its complexity. I doubt any analysis of the data will result in finding one split through one or two features that effectively separates POI's from non-POI's. This is where ML really shines - some MLAs even come with feature importance in python's [Scikit-Learn].
MLAs work by minimizing an error function, through iterative tuning of the algorithms parameters. Some MLAs work better for classification, and others better for regression. In many cases the tuning of the MLA parameters is difficult, and depending on the algorithm, it can be very sensitive to the input features and types of data. Sometimes, one needs to scale the data, other times you shouldn't, and finding the best MLA, input features and initial parameters, is almost always an iterative process. Therefore, the solutions provided by a trained MLA, will be the best possible predictions availble, for that specific algorithm, with those specific parameters, and with the training data and features available.
For this reason, chosing the right algorithm and finding the minimum set of important features, is vital to the prediction rate of any MLA. I will try as best as I can to explain, the selection of various MLAs and features, but I assume that the reader has basic knowledge of python.
This analysis is done as a part of my Data Scientist Nanodegree at Udacity.
from IPython.core.display import HTML
from IPython.display import display
def css_styling():
#styles = open("C:/Users/rogvid/.jupyter/custom/custom.css", "r").read()
#or edit path to custom.css
styles = open("custom.css", "r").read() #or edit path to custom.css
return HTML(styles)
css_styling()
The data was obtained by cloning the udacity git repository using the following command:
>> git clone https://github.com/udacity/ud120-projects.git
Through the Introduction to Machine Learning course on Udacity, we created and used a bunch of functions, some of which were provided as testing functions for this final project. To get an initial feel of the data, we can start by running a Naive Bayes Classifier on "only" the financial data and a bare minimum of features and with no data wrangling or cleaning.
#!/usr/bin/python
### The following imports and settings are configurations for the notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
###
import sys
import pickle
sys.path.append("../tools/")
plt.rcParams['figure.figsize'] = (14, 10)
### python scripts for formatting the financial data and splitting it into
### Target and Features, as well as a "tester" which evaluates the
### trained MLA
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
### Task 1: Select what features to use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = ['poi','salary']
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
### Task 2: Remove outliers
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html
# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html
# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.3, random_state=42, stratify=labels)
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
dump_classifier_and_data(clf, my_dataset, features_list)
%run tester.py
As we can see, the tester spits out some information on how well the specific classifier did on the data. The definition of each of these scores can be found many places, but simply put:
where $\beta < 1$, weights precision higher, and $\beta > 1$, weights recall higher. A measure that isn't represented here, is Cohen's Kappa. Cohen's Kappa is a statistic which measures the inter-rater agreement for qualitative items. Given that we have a dataset, where out label/predictor variable is an uncertain classification - I believe the poi label is an uncertain classification, since it is a definition that stamps a person as interesting for a given investigation, and since we are unsure of whether or not the stamping of a person is in fact objectively correct, it is likely that the investigators have stamped or not stamped people incorrectly - the Cohen's Kappa score is a good measure for this particular problem. In using Cohen's Kappa, one rater would be the actual labels of the poi variable, and the other rater would be the machine learning classification algorithm. Cohen's Kappa is mathematically defined as:
$$ \kappa = \frac{p_o - p_e}{1 - p_e} $$where $p_o$ is the relative observed agreement between raters, and $p_e$ is the hypothetical probability of chance agreement.
NOTE: For the project 6 hand in for the Udacity Data Analyst Nanodegree, we are asked to maximize precision and recall, or more precisely asked to optimize our prediction so that both precision and recall are greater than $0.3$. So to get a better understanding of these metrics of the Precision-Recall score I'll give a short description of these. The precision and recall are mathematically defined as:
$$ \begin{align} Precision &= \frac{\sum{\text{TRUE POSITIVES}}}{\sum{\text{TRUE POSITIVES}} + \sum{\text{FALSE POSITIVES}}}\\ Recall &= \frac{\sum{\text{TRUE POSITIVES}}}{\sum{\text{TRUE POSITIVES}} + \sum{\text{FALSE NEGATIVES}}} \end{align} $$The intuitive explanation is that the precision score, tells you how good the predictor is at correctly classifying events. In terms of this dataset, it tells us something about how good the predictor is at correctly classifying a person as a POI as opposed to incorrectly classifying people as POI. It doesn't however tell us anything about how bad the predictor is at not classifying a person as a POI, e.g. if the dataset contains 130 people and 10 POI's but the predictor only predicts one POI correctly, the precision is 1, but the predictor has only found 1 out of 10. So the precision in it self is not a good measure, as it neglects the amount of data that one has not classified, i.e. the false negatives. To supplement the precision measures lack of quantitative information, we can use the recall score. This tells us how good the predictor is at finding correct classifications, as opposed to incorrect classifications. So, using the same example as above, the recall score would be 0.09.
data_array = np.array([[v for k, v in y.items()] for x, y in my_dataset.items()])
header = np.array([[k for k, v in y.items()] for x, y in my_dataset.items()])[0, :]
rows = np.array([x for x, y in my_dataset.items()])
data_df = pd.DataFrame(data_array, index=rows, columns=header)
data_df = data_df.convert_objects(convert_numeric=True)
display(HTML("<center>{0}</center>".format(data_df.dtypes.to_frame("dtypes").to_html())))
So in the financial data, we have two features that aren't numeral. The POI, is a boolean, describing wether or not a given row / person is a person-of-interest, and the email_address, is ofcourse the persons' Enron email address.
Let's get a feel of the financial data by printing out a desciption of the data
display(HTML("<center>{0}<br/>{1}</center>".format("Shape of dataframe: (rows, columns) = ({0},{1})".format(data_df.shape[0], data_df.shape[1]), data_df.describe().T.to_html())))
So as we can see the dataset contains financial data for 146 people and from the 'count' column and the percentile columns, we can see that the dataset contains many NaN (Not a Number) elements. Futhermore, the poi, and the email_address features weren't included in this description. Let's see how many POIs actually are in this financial dataset
display(HTML("<center>Number of POIs = {0}</center>".format(np.sum([i == "True" for i in data_df['poi'].values]))))
To go a bit deeper into the amount of missing data we can look at the sorted 'count' column divided by the number of rows in the dataframe to get a percentage of actual data points per feature:
counts = data_df.describe().T[['count']].sort_values('count') / float(len(data_df)) * 100
display(
HTML(
"<center>{0}</center>".format(
pd.concat([counts.head(5),pd.DataFrame(data={"count":".."}, index=[".."]), counts.tail(5)]).to_html()
)
)
)
So we have as little as 2.7% data for the loan advances and the feature with most data has 86.3% non empty data points. Given that the dataset only contained 146 rows, 2.7% is the same as 4 data points. We can check if this feature has any predictive power by checking if the 4 people who got loan advances were POIs.
display(HTML("<center>{0}</center>".format(data_df.loc[data_df['loan_advances'].dropna().index, 'poi'].to_frame("poi").to_html())))
So 1 out of the 4 people who got loan advances was a POI. Since the number of actual data points is so small and only 1 in 4 was a POI (and one is TOTAL), this feature can be disregarded. We can do the same check for director fees.
display(HTML("<center>{0}</center>".format(data_df.loc[data_df['director_fees'].dropna().index, 'poi'].to_frame("poi").to_html())))
Besides the fact that none of these are POIs, one of these names is TOTAL. Not to disrespect anyone, but I believe that the TOTAL "person" is actually just the sum of the all the people in the financial data set. So this row can be removed leaving us with 145 rows of data. To make sure this makes sense we can see if the sum of all other people in the data set sum up to the TOTAL.
header = " Feature | SUM($) | TOTAL($) | DIFF "
print(header)
print("="*len(header))
for s, t in zip(data_df.columns, data_df.loc['TOTAL']):
try:
print "{0:<29s}|{1:>14.2f} mio. |{2:>15.2f} mio. |{3:>12.3f}%".format(s, data_df.loc[data_df.index != "TOTAL", s].abs().sum()/1000.0, np.abs(t) / 1000.0, (data_df.loc[data_df.index != "TOTAL", s].abs().sum() - np.abs(t))/np.abs(t) * 100)
except:
pass
#data_df.drop("TOTAL", inplace=True)
Again we see that there are some discrepencies. We should, if all was right, see that the sum of each person would be the same as TOTAL. However, there are some differences, most notably in restricted stock deferred. I went back and looked at the [insider pay financial report]. I quite quickly found that the information in the data set for BHATNAGAR, SANJAY didn't match what was in the insider pay financial report. In the report Sanjay has \$137.864 in expenses and total payments, \$15.456.290 in exercised stock option and total stock value, and \$2.604.490 in restricted stock and restricted stock deferred. However in the data loaded, this is how his financials look like:
data_df.loc['BHATNAGAR SANJAY']
So it looks as though, the information on him has somehow been shifted, resulting in wrong values for the various features. To fix this we can just manually rearange the values.
data_df.loc['BHATNAGAR SANJAY', 'expenses'], data_df.loc['BHATNAGAR SANJAY', 'other'] = data_df.loc['BHATNAGAR SANJAY', 'other'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'total_payments'], data_df.loc['BHATNAGAR SANJAY', 'director_fees'] = data_df.loc['BHATNAGAR SANJAY', 'director_fees'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options'], data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'] = data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock']
data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], data_df.loc['BHATNAGAR SANJAY', 'total_stock_value'] = -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'], data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options']
data_df.loc['BHATNAGAR SANJAY']
header = " Feature | SUM($) | TOTAL($) | DIFF "
print(header)
print("="*len(header))
for s, t in zip(data_df.columns, data_df.loc['TOTAL']):
try:
print "{0:<29s}|{1:>14.2f} mio. |{2:>15.2f} mio. |{3:>12.3f}%".format(s, data_df.loc[data_df.index != "TOTAL", s].abs().sum()/1000.0, np.abs(t) / 1000.0, (data_df.loc[data_df.index != "TOTAL", s].abs().sum() - np.abs(t))/np.abs(t) * 100)
except:
pass
#data_df.drop("TOTAL", inplace=True)
We got closer to a complete match between the TOTAL and the sum of all values, but still there are discrepencies. I therefore went back and analyzed the director fees in the financial report and matched it with the data set. I found that BELFER ROBERT also had values shifted so that they didn't match the correct feature.
data_df.loc['BELFER ROBERT', 'expenses'] = data_df.loc['BELFER ROBERT', 'exercised_stock_options']
data_df.loc['BELFER ROBERT', 'exercised_stock_options'] = np.nan
data_df.loc['BELFER ROBERT', 'director_fees'] = data_df.loc['BELFER ROBERT', 'total_payments']
data_df.loc['BELFER ROBERT', 'total_payments'] = data_df.loc['BELFER ROBERT', 'expenses']
data_df.loc['BELFER ROBERT', 'deferred_income'] = data_df.loc['BELFER ROBERT', 'deferral_payments']
data_df.loc['BELFER ROBERT', 'deferral_payments'] = np.nan
data_df.loc['BELFER ROBERT', 'restricted_stock'] = data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'restricted_stock_deferred'] = -data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'total_stock_value'] = np.nan
header = " Feature | SUM($) | TOTAL($) | DIFF "
print(header)
print("="*len(header))
for s, t in zip(data_df.columns, data_df.loc['TOTAL']):
try:
print "{0:<29s}|{1:>14.2f} mio. |{2:>15.2f} mio. |{3:>12.3f}%".format(s, data_df.loc[data_df.index != "TOTAL", s].abs().sum()/1000.0, np.abs(t) / 1000.0, (data_df.loc[data_df.index != "TOTAL", s].abs().sum() - np.abs(t))/np.abs(t) * 100)
except:
pass
Finally after going through the financial data a couple of times, we can see that there is a 1 to 1 match between the sum of the peoples values and the TOTAL.
Before I move on from this very tedious cleansing of the data, I want to point out one more thing I found in the dataset that we don't want. Looking at the names in the financial report I noticed that the last name, above the TOTAL row, is THE TRAVEL AGENCY IN THE PARK. Reading the footnotes of the financial report, it turned out this travel agency was coowned by the sister of a former chairman of Enron. Since this is not a person, I will remove this row from the dataset as well as the TOTAL row.
data_df.drop("THE TRAVEL AGENCY IN THE PARK", inplace=True)
data_df.drop("TOTAL", inplace=True)
Now I will move on to a little more visual ways of inspecting the data.
Now that we've manually cleaned out some data that was obiously wrong, it is time to look at the data, and see if we can get some information by doing some simple visualizations.
First I want to filter out the poi- and the email address information from this dataframe, which will leave me with a fully numeric dataframe, which I in turn can visualize to see if we have any outliers.
all_data = data_df.copy()
label_df = data_df.pop("poi").to_frame("poi")
email_address_df = data_df.pop("email_address").to_frame("email_address")
plt.scatter(data_df['salary'], data_df['bonus'], s=80, alpha=0.5)
plt.xlabel("salary")
plt.ylabel("bonus")
plt.xlim(0, data_df['salary'].max() + data_df['salary'].max()*0.05)
plt.ylim(0, data_df['bonus'].max() + data_df['bonus'].max()*0.05)
plt.title('Salary-Bonus Relation')
There are some outliers, but it's difficult to say if they are outliers because of bad data, or because they were just well payed employees. To see the names of these well payed employees we can look at the top 10 earners in the data set.
display(HTML("<center>{0}</center>".format(data_df.sort_values('salary', ascending=False).salary.head(10).to_frame("Salary").to_html())))
For those of you who've seen the "Enron: The Smartest Guys in the Room" documentary, some of these names will be very familiar, most notably Jeffrey Skilling, Kenneth Lay, and Andrew Fastow. We can visualize the above figure with POIs plotted in red to see if there is any clear pattern.
from sklearn import linear_model
x = np.arange(0, 1300000)
reg = linear_model.LinearRegression()
data_features = data_df[['salary', 'bonus']].dropna()
reg.fit(data_features['salary'].values[np.newaxis].T,
data_features['bonus'].values[np.newaxis].T)
print("r-squared: {0}".format(reg.score(data_features['salary'].values[np.newaxis].T,
data_features['bonus'].values[np.newaxis].T)))
pred = reg.predict(x[np.newaxis].T)
labels = label_df.copy()
labels['poi'] = [False if l[0] == 'False' else True for l in label_df.values]
POIS = labels[labels['poi'] == True]
# Create scatter plot of salary and bonus
# and plot poi's on top to illustrate outlier importance
plt.scatter(data_df['salary'],
data_df['bonus'],
color='b',
s=180,
alpha=0.5,
label='Observations')
plt.scatter(data_df.loc[POIS.index, 'salary'],
data_df.loc[POIS.index, 'bonus'],
marker='o',
color='r',
s=100,
label='POIs')
plt.plot(x, pred, '--', linewidth=2.0, alpha=0.9, label='Regression-Line')
plt.legend(loc=1, bbox_to_anchor=(1.20, 1.01))
plt.xlim(0, data_df['salary'].max() + data_df['salary'].max()*0.05)
plt.ylim(0, data_df['bonus'].max() + data_df['bonus'].max()*0.05)
plt.title('Salary-Bonus Relation')
So as we can now see, by fitting a line to the data, we can see that there is a correlation between bonus and salary. Given that we have an $r^2 = 0.27$ it is hard to say if there is a very small linear correlation, or if the data is slightly non linear.
Looking at the visualization there are still a couple of data points which one could argue are outliers, but this might be an indication that these data points are POIs. To illustrate what I mean, I've plotted the POIs on top of the data points to show that some of the outliers are in fact POIs.
Before moving on to a machine learning approach, I'll have a last look at the features, to see if any of them should be investigated.
for i, f in enumerate(data_df.columns):
print "Feature {1:02.0f}: {0:>85s}".format(f, i)
Most of the features look like normal information found from financial statements, but some of them do seem a bit "too" informative. There are numerous features with POI in the feature name. All of these features, shared_receipt_with_poi, from_this_person_to_poi, and from_poi_to_this_person contain information that is only known because the POIs have been marked. Since these features are classic examples of data leakage, i.e. POI knowledge is intrinsic to these features, and using them would lead to overfitting the model. Another issue with these features is that they are manufactured by the makers of the data, i.e. somone has marked people as POI, and used this POI info to create these data leaking features. This could lead to errors, if someone is falsely labeled as a POI or vice versa. Ofcourse a MLA would have difficulty predicting POIs in a dataset if a large percentage of the people who were marked as POI were falsely marked. But in the case where only a small percentage of people are falsesly labeled POI or non POI, the target for the MLA wouldn't be too greatly affected, as these could be considered as outliers. However, features which rely heavily on the knowledge of who is and who isn't a POI, like 'receipts_shared_with_poi', which more than likely will have a very high feature importance, would be distortet, leading to errors in the input features.
I will therefore remove all features that have poi information as part of the feature.
try:
data_df.drop([c for c in data_df.columns if "poi" in c], axis=1, inplace=True)
except ValueError as e:
print ""
In the end, we end up with a bit smaller dataset. I would however like to add a feature to the dataset. Looking through the financial data gave me the idea that there was a correlation between relatively low salary to high stock values, except for in special cases. To analyse this further I am adding the feature total_stock_value / salary.
data_df['stock_salary_fraction'] = [data_df.loc[person, 'total_stock_value'] / data_df.loc[person, 'salary'] if data_df.loc[person, 'salary'] > 10000 else 0 for person in data_df.index]
all_data['stock_salary_fraction'] = [all_data.loc[person, 'total_stock_value'] / all_data.loc[person, 'salary'] if all_data.loc[person, 'salary'] > 10000 else 0 for person in all_data.index]
Now we can visualize the distribution of "stock_salary_fraction" for the POI and non POI:
fig, ax = plt.subplots(2, 1)
all_data.loc[all_data['poi'] == "True", "stock_salary_fraction"].plot(kind='hist', ax=ax[0], )
all_data.loc[all_data['poi'] == "False", "stock_salary_fraction"].plot(kind='hist', ax=ax[1])
display(HTML("<center>{0}</center>".format(all_data.sort_values("stock_salary_fraction", ascending=False)[["stock_salary_fraction", "poi"]].head(10).to_html())))
To see how the POI's are distributed per feature, we can look at the percentage of poi's in the given feature that aren't NaN:
for f in all_data.columns:
if f in ["poi", "restricted_stock_deferred", "director_fees", "email_address"]:
continue
print "Percent of values as POI for feature: {1:>40s} => {0:2.1f}% x {2:3.0f} = {3:2.0f}".format(np.sum([p=="True" for p in all_data[[f, 'poi']].dropna().poi]) / float(len(all_data[[f, 'poi']].dropna().poi)) * 100, f, len(all_data[[f, "poi"]].dropna().poi), np.sum([p=="True" for p in all_data[[f, 'poi']].dropna().poi]))
Now that I've removed some features, fixed some rows, and removed some other rows, I really want to do a scatter matrix to see if a visualization of feature-feature correlation, can give us some more information on the dataset.
Because there are so many features, and the feature values are of different orders of magnitude, we will first need to features that are highly correlated to many others, and standardize the data, i.e. transform the data to zero mean and unit variance, to make visual inspection even easier.
import seaborn as sns; sns.set(style="ticks", color_codes=True)
scatter_plot_df = data_df.copy()
data_df_standardized = (scatter_plot_df - scatter_plot_df.mean()) / scatter_plot_df.std()
data_df_standardized_no_outliers = data_df_standardized[np.abs(data_df_standardized-data_df_standardized.mean())<=(3*data_df_standardized.std())]
data_df_standardized_no_outliers_no_correlated_features = data_df_standardized_no_outliers.drop(["total_stock_value", "total_payments", "stock_salary_fraction", "restricted_stock_deferred", "loan_advances", "director_fees", "deferral_payments"], axis=1).dropna(how='all').dropna(how='all', axis=1)
data_df_standardized_no_outliers_no_correlated_features['poi'] = label_df
data_df_standardized_no_outliers_no_correlated_features.fillna(data_df_standardized_no_outliers_no_correlated_features.median(), inplace=True)
#g = sns.pairplot(data_df_standardized_no_outliers_no_correlated_features,
# diag_kind="kde",
# hue="poi",
# palette="husl",
# diag_kws=dict(shade=True))
g = sns.PairGrid(data_df_standardized_no_outliers_no_correlated_features, hue="poi", palette='husl')
g.map_upper(plt.scatter)
g.map_diag(plt.hist, alpha=0.5)
g.map_lower(sns.kdeplot)
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
g.add_legend()
g.set(alpha=0.5)
Even though this is a nice visualization, there isn't really anything that stands out. From the diagonal it is obvious that there are differences in the distribution of feature values for poi's and non-poi's. However, this difference is subtle, and it is therefore difficult to do anything.
After going through all this data with the initial assumption that "-" represented NaN, I questioned the notion, what if they don't represent NaN, but rather just missing values? Looking at the insider pay pdf file, it seems more plausible and informative if these dashes simply represent a value that the either wasn't reported or couldn't be retrieved. Furthermore if we do accept that these dashes represent actual missing values, it gives us more data to work.
Now that I've gone through the data, I would like to get into the nitty-gritty details of working on this dataset, with the explicit task in mind, that I would like to build the best predictor possible.
I'll start by showing the entire code, which was used as the hand in for the Nanodegree project, and afterwards I'll go through all the steps.
#!/usr/bin/python
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use("ggplot")
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.naive_bayes import GaussianNB
from sklearn import tree, preprocessing, cross_validation, feature_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix, precision_score, \
recall_score, accuracy_score, cohen_kappa_score, \
make_scorer, precision_recall_fscore_support
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit, train_test_split
from sklearn.feature_selection import VarianceThreshold, SelectKBest
### To use Cohen's Kappa Score as a metric for the cross validation function
### I need to convert it to the right format, whence I can use the make_scorer
### function of the sklearn.metrics module
kappa_scorer = make_scorer(cohen_kappa_score)
### Since F-score weights precision and recall equally, I'm interested in seeing
### how the precision-recall score would change by using the F2 and F0.5 score.
def f05(y_true, y_pred):
prec, rec, fbeta, sup = precision_recall_fscore_support(y_true, y_pred, beta=0.5)
return fbeta[0]
f05_scorer = make_scorer(f05)
### Task 1: Select features
features_list = ['poi',
'salary', ##
'to_messages',
#'deferral_payments',
#'total_payments',
'exercised_stock_options', ###
'bonus', ###
'restricted_stock',
'total_stock_value', ###
'expenses',
'from_messages',
'other',
'deferred_income',
'long_term_incentive', #
'stock_salary_fraction',
]
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
### Task 2: Remove outliers
### From EDA I found that there is a key in the data dict named TOTAL,
### which is the sum of all the information in the data dictionary.
### Furthermore there is also a key called THE TRAVEL AGENCY IN THE PARK, which
### isn't a person.
### These are therefore removed
data_dict.pop('TOTAL')
data_dict.pop('THE TRAVEL AGENCY IN THE PARK')
data_df = pd.DataFrame(data_dict).T
data_df = data_df.convert_objects(convert_numeric=True)
### The information on BHATNAGAR SANJAY IS WRONG! MATCH INFO WITH FINANCIAL SHEET!
data_df.loc['BHATNAGAR SANJAY', 'expenses'], data_df.loc['BHATNAGAR SANJAY', 'other'] = data_df.loc['BHATNAGAR SANJAY', 'other'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'total_payments'], data_df.loc['BHATNAGAR SANJAY', 'director_fees'] = data_df.loc['BHATNAGAR SANJAY', 'director_fees'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options'], data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'] = data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock']
data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], data_df.loc['BHATNAGAR SANJAY', 'total_stock_value'] = -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'], data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options']
### The information on BELFER ROBERT IS WRONG! MATCH INFO WITH FINANCIAL SHEET!
data_df.loc['BELFER ROBERT', 'expenses'] = data_df.loc['BELFER ROBERT', 'exercised_stock_options']
data_df.loc['BELFER ROBERT', 'exercised_stock_options'] = np.nan
data_df.loc['BELFER ROBERT', 'director_fees'] = data_df.loc['BELFER ROBERT', 'total_payments']
data_df.loc['BELFER ROBERT', 'total_payments'] = data_df.loc['BELFER ROBERT', 'expenses']
data_df.loc['BELFER ROBERT', 'deferred_income'] = data_df.loc['BELFER ROBERT', 'deferral_payments']
data_df.loc['BELFER ROBERT', 'deferral_payments'] = np.nan
data_df.loc['BELFER ROBERT', 'restricted_stock'] = data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'restricted_stock_deferred'] = -data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'total_stock_value'] = np.nan
### Create new feature
data_df['stock_salary_fraction'] = [tsv / s if s > 10000 else 0 for tsv, s in zip(data_df['total_stock_value'].values, data_df['salary'].values)]
### Replace "NaN" strings with numpy nans and transform the data back in to
### dictionary form, so that it can work with the featureFormat function
data_df = data_df.replace("NaN", np.nan)
data_dict = data_df.T.to_dict()
### Task 3: Create new feature(s)
### Generate features from the email corpus. Use the vectorize text,
### and email data frame to assign names to features.
### Store to my_dataset for easy export below.
my_dataset = data_dict
### Extract features and labels from dataset for local testing
all_keys = my_dataset.keys()
data = featureFormat(my_dataset, features_list, remove_all_zeroes=False, sort_keys = True)
labels, features = targetFeatureSplit(data)
### Task 4: Try a varity of classifiers
# svr = GaussianNB()
# svr = tree.DecisionTreeClassifier()
# svr = RandomForestClassifier(n_estimators=50, class_weight={1: 8}, max_depth=6)
svr = SVC(class_weight={1: 8})
### Create a classifier list, so that we can visualize classification decision boundaries
pipe = Pipeline([
('feature_imputing', preprocessing.Imputer(missing_values='NaN', strategy='median', axis=0)),
('feature_scaling', preprocessing.StandardScaler()),
('feature_selection', SelectKBest()),
#('feature_selection_variance_threshold', VarianceThreshold(threshold=(.8 * (1 - .8)))),
#('feature_scaling', preprocessing.MinMaxScaler()),
('feature_reduction', PCA()),
('classifier', svr)])
clf = svr
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script.
cv = StratifiedShuffleSplit(labels, 50, random_state = 42)
search_params = {}
search_params['classifier__C'] = [1, 10]
search_params['classifier__gamma'] = [0.01, 0.001]
search_params['classifier__kernel'] = ['linear', 'rbf']
search_params['feature_selection__k'] = range(4,len(features_list))
search_params['feature_reduction__n_components'] = range(2, 5)
search_params['feature_reduction__whiten'] = [True, False]
grid_params = [search_params]
#estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring="f1")
#estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring=f05_scorer)
estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring=kappa_scorer)
estimator.fit(features, labels)
clf = estimator.best_estimator_
from tester import *
test_classifier(clf, my_dataset, features_list)
### Task 6: Dump classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
dump_classifier_and_data(clf, my_dataset, features_list)
#!/usr/bin/python
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.style.use("ggplot")
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.naive_bayes import GaussianNB
from sklearn import tree, preprocessing, cross_validation, feature_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix, precision_score, \
recall_score, accuracy_score, cohen_kappa_score, \
make_scorer, precision_recall_fscore_support
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit, train_test_split
from sklearn.feature_selection import VarianceThreshold, SelectKBest
### To use Cohen's Kappa Score as a metric for the cross validation function
### I need to convert it to the right format, whence I can use the make_scorer
### function of the sklearn.metrics module
kappa_scorer = make_scorer(cohen_kappa_score)
### Since F-score weights precision and recall equally, I'm interested in seeing
### how the precision-recall score would change by using the F2 and F0.5 score.
def f05(y_true, y_pred):
prec, rec, fbeta, sup = precision_recall_fscore_support(y_true, y_pred, beta=0.5)
return fbeta[0]
f05_scorer = make_scorer(f05)
I start by importing the modules I need. For doing machine learning in python, sklearn is the goto library. To read more about it see [scikit-learn]. From sklearn I import quite a lot of functions, many of which aren't used in the final setup. But since they were used at some point in my trial and error approach to getting this up and running I'll go through each of them, where they initially were used.
### Task 1: Select features
features_list = ['poi',
'salary',
'to_messages',
'exercised_stock_options',
'bonus',
'restricted_stock',
'total_stock_value',
'expenses',
'from_messages',
'other',
'deferred_income',
'long_term_incentive',
'stock_salary_fraction',
]
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
Of the initial 20 features, I ended up using these 12 features. The financial data had also been converted to a pickled object, so loading the data was simple.
### Task 2: Remove outliers
### From EDA I found that there is a key in the data dict named TOTAL,
### which is the sum of all the information in the data dictionary.
### Furthermore there is also a key called THE TRAVEL AGENCY IN THE PARK, which
### isn't a person.
### These are therefore removed
data_dict.pop('TOTAL')
data_dict.pop('THE TRAVEL AGENCY IN THE PARK')
data_df = pd.DataFrame(data_dict).T
data_df = data_df.convert_objects(convert_numeric=True)
### The information on BHATNAGAR SANJAY IS WRONG! MATCH INFO WITH FINANCIAL SHEET!
data_df.loc['BHATNAGAR SANJAY', 'expenses'], data_df.loc['BHATNAGAR SANJAY', 'other'] = data_df.loc['BHATNAGAR SANJAY', 'other'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'total_payments'], data_df.loc['BHATNAGAR SANJAY', 'director_fees'] = data_df.loc['BHATNAGAR SANJAY', 'director_fees'], np.nan
data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options'], data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'] = data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock']
data_df.loc['BHATNAGAR SANJAY', 'restricted_stock_deferred'], data_df.loc['BHATNAGAR SANJAY', 'total_stock_value'] = -data_df.loc['BHATNAGAR SANJAY', 'restricted_stock'], data_df.loc['BHATNAGAR SANJAY', 'exercised_stock_options']
### The information on BELFER ROBERT IS WRONG! MATCH INFO WITH FINANCIAL SHEET!
data_df.loc['BELFER ROBERT', 'expenses'] = data_df.loc['BELFER ROBERT', 'exercised_stock_options']
data_df.loc['BELFER ROBERT', 'exercised_stock_options'] = np.nan
data_df.loc['BELFER ROBERT', 'director_fees'] = data_df.loc['BELFER ROBERT', 'total_payments']
data_df.loc['BELFER ROBERT', 'total_payments'] = data_df.loc['BELFER ROBERT', 'expenses']
data_df.loc['BELFER ROBERT', 'deferred_income'] = data_df.loc['BELFER ROBERT', 'deferral_payments']
data_df.loc['BELFER ROBERT', 'deferral_payments'] = np.nan
data_df.loc['BELFER ROBERT', 'restricted_stock'] = data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'restricted_stock_deferred'] = -data_df.loc['BELFER ROBERT', 'restricted_stock_deferred']
data_df.loc['BELFER ROBERT', 'total_stock_value'] = np.nan
As I stated in the exploratory data analysis, I remove the "TOTAL" person, and the "THE TRAVEL AGENCY IN THE PARK" person. Furthermore I modify the information for the two incorrectly loaded persons Sanjay Bhatnagar, and Robert Belfer.
### Task 3: Create new feature(s)
data_df['stock_salary_fraction'] = [tsv / s if s > 10000 else 0 for tsv, s in zip(data_df['total_stock_value'].values, data_df['salary'].values)]
### TODO: GENERATE EMAIL FEATURES:
### Generate features from the email corpus. Use the vectorize text,
### and email data frame to assign names to features.
### Store to my_dataset for easy export below.
### Replace "NaN" strings with numpy nans and transform the data back in to
### dictionary form, so that it can work with the featureFormat function
data_df = data_df.replace("NaN", np.nan)
data_dict = data_df.T.to_dict()
my_dataset = data_dict
### Extract features and labels from dataset for local testing
all_keys = my_dataset.keys()
data = featureFormat(my_dataset, features_list, remove_all_zeroes=False, sort_keys = True)
labels, features = targetFeatureSplit(data)
I start by creating the feature "stock_salary_fraction". Thereafter I split the data into labels and features, as the data structure also contains poi information. As you can see, there is a TODO left. I didn't manage to find time to actually generate features based on the email corpus, but when I do find time, I will return to the data and do an update.
### Task 4: Try a varity of classifiers
# classifier = GaussianNB()
# classifier = tree.DecisionTreeClassifier()
# classifier = RandomForestClassifier(n_estimators=50, class_weight={1: 8}, max_depth=6)
classifier = SVC(class_weight={1: 8})
### Create a classifier list, so that we can visualize classification decision boundaries
pipe = Pipeline([
('feature_imputing', preprocessing.Imputer(missing_values='NaN', strategy='median', axis=0)),
('feature_scaling', preprocessing.StandardScaler()),
('feature_selection', SelectKBest()),
#('feature_selection_variance_threshold', VarianceThreshold(threshold=(.8 * (1 - .8)))),
#('feature_scaling', preprocessing.MinMaxScaler()),
('feature_reduction', PCA()),
('classifier', classifier)])
This is where the fun begins. This very short section contains quite a lot of information. First, I start by defining a classifier. I tested, the Gaussian Naive Bayes, Decision Tree Classifier, Random Forest Classifier, and (the winner) Support Vector Machines Classification. Last, I define my pipeline.
In a moment I will describe how this Pipeline is used, and what the various elements of the pipeline are. But first, I'll give a short introduction to the classifiers used in this project.
Gaussian Naive-Bayes The Gaussian Naive-Bayes algorithm, is a probabilistic classification algorithm based on Bayes' Theorem. The Naive part of the algorithm comes from it's assumption of strong independence between the features.
Decision Tree The Decision Tree algorithm, is precisely what it sounds. This classification algorithm is based on the very simple idea of generating a decision tree, which is able to categorize classes based on various splits of the features. Tree based classifiers, are generally fast, and quite robust, but sometimes suffer from the fact that they do splits on a per feature basis, i.e. in a two dimensional dataset, the cuts would represent vertical and horizontal lines through the plane.
Random Forest Classifier The Random Forest Classifier, is an ensemble based algorithm. It works by generating a number of decision trees, training each of them separately. When predicting, the Random Forest Classifier then uses the most frequent prediction from its "forest" of decision tree to make its prediction.
Support Vector Machines The Support Vector Machines algorithm, works by generating the best possible split through the features. In a two class problem in two dimensions, this would correspond to finding the line through the plane which best splits the plane into a class 1 side and a class 2 side. Since a single line, can split a two dimensional plane in many ways, which would all result in two distinct classification sides, Support Vector Machines has the additional criteria, that it tries to maximize the margin between the line and the closest point on each side. One of the strongest points of Support Vector Machines, is the use of what is called the kernel expansion. In short the kernel expansion, is a method to expand your data into more dimensions, so that one can do a split in the data in this higher dimension.
Results from training
| Metric \ Algorithm | GaussianNB | Decision Tree | Random Forest | Support Vector Machines |
|---|---|---|---|---|
| Precision | 0.4628 | 0.2793 | 0.2889 | 0.4324 |
| Recall | 0.3390 | 0.5940 | 0.5440 | 0.4250 |
| F1 | 0.3913 | 0.3799 | 0.3774 | 0.4286 |
| Cohen's Kappa | 0.3646 | 0.3085 | 0.3243 | 0.3569 |
| Accuracy | 0.8594 | 0.7415 | 0.7607 | 0.8489 |
One thing that pops out when I'm using the SVC class, is the input class_weight={1: 8}. The reason I am forcing this input instead of adding it as a parameter to my grid search, is that I know that the dataset is imbalanced, and that I only have 1/8 people being POI. This means that to get an equal weight on the POI's I would either need to downsample the bigger class, the non POIs, but my dataset is way to small for that, or I can upsample the POIs, or as a third option I can weight the classes differently. I chose to weight the classes differently as this is already implemented in sklearn.
To see how the decision boundary for the various classifiers look like I visualized them along with some classifiers I just wanted to visualize
from sklearn.naive_bayes import GaussianNB
from sklearn import tree, preprocessing
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
names = ["Gaussian Naive Bayes",
"Decision Tree",
"Random Foreest",
"Support Vector Machine",
"K Nearest Neighbors",
"Neural Network",
"Ada Boost"]
classifiers = [GaussianNB(),
tree.DecisionTreeClassifier(),
RandomForestClassifier(class_weight={1:8}),
SVC(C=10, gamma=0.001, class_weight={1:8}),
KNeighborsClassifier(3),
MLPClassifier(),
AdaBoostClassifier()]
h = 0.2
from matplotlib.colors import ListedColormap
### impute, standardize, and reduce the number of features
labels, features = targetFeatureSplit(data)
features = preprocessing.Imputer(missing_values="NaN", strategy='median', axis=0).fit_transform(features)
features = preprocessing.StandardScaler().fit_transform(features)
features = SelectKBest(k=7).fit_transform(features, labels)
features = PCA(n_components=2).fit_transform(features, labels)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=.3, random_state=42, stratify=labels)
x_min, x_max = features[:, 0].min() - .5, features[:, 0].max() + .5
y_min, y_max = features[:, 1].min() - .5, features[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# just plot the dataset first
#cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
cm_bright = plt.cm.plasma
cm = plt.cm.plasma
fig, ax = plt.subplots(1, len(classifiers) + 1, figsize=(60, 10))
ax[0].set_title("Input data", size=55)
# Plot the training points
ax[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=160, cmap=cm_bright)
# and testing points
ax[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=160, cmap=cm_bright, alpha=0.6)
ax[0].set_xlim(xx.min(), xx.max())
ax[0].set_ylim(yy.min(), yy.max())
ax[0].set_xticks(())
ax[0].set_yticks(())
i = 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
score = cohen_kappa_score(prediction, y_test)
accuracy = accuracy_score(prediction, y_test)
f1 = f1_score(prediction, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax[i].contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax[i].scatter(X_train[:, 0], X_train[:, 1], s=160, c=y_train, cmap=cm_bright)
# and testing points
ax[i].scatter(X_test[:, 0], X_test[:, 1], s=160, c=y_test, cmap=cm_bright,
alpha=0.6)
ax[i].set_xlim(xx.min(), xx.max())
ax[i].set_ylim(yy.min(), yy.max())
ax[i].set_xticks(())
ax[i].set_yticks(())
ax[i].set_title(name, size=40)
ax[i].text(xx.max() - .3, yy.min() - .55, ("C.Kappa = %.2f" % score).lstrip('0'),
size=40, horizontalalignment='right')
ax[i].text(xx.max() - .3, yy.min() - 1.15, ("F1 = %.2f" % f1).lstrip('0'),
size=40, horizontalalignment='right')
ax[i].text(xx.max() - .3, yy.min() - 1.75, ("Accuracy = %.2f" % accuracy).lstrip('0'),
size=40, horizontalalignment='right')
i += 1
plt.tight_layout()
plt.show()
Now, moving onto describing the use of the Pipeline, why these feature scaling / selection / imputing / reduction algorithms were chosen as opposed to others.
First of all, the Pipeline in sklearn allows me to generate a structured way of parsing the data, both for training and testing. By creating this pipeline, I am certain that when I test the performance of my method, I know that the test data will undergo the same procedures as the testing data did.
Now, as for the ordering of the functions, this is very important.
Feature Imputing: Feature imputing is a process in which one converts missing data into data. In the case of the financial dataset, it contained many data points with dashes, that were read as NaN. Since, there was so little data to begin with, this was an issue, since the classification algorithms of sklearn don't handle NaN so well, if at all. The simplest way to solve this problem is to fill all the NaN's with zeros, but that will also distort the distribution of data, since not all data is distributed around 0. So, to the rescue comes feature imputing. The Imputer function has various options one can specify, and in the finished product I used the median imputing strategy on the zeroth axis. This just means that if a value is missing for a person, this value is filled with the median of that data points feature. I tried with the mean, and "stupidly" also on the first axis, but the best results came from using the median.
Feature Scaling: Feature scaling is one of the most important processes in machine learning, and often in data visualizations and data handling. Feature scaling is the process of scaling the feature values. This scaling can be done in a number of ways, but the most frequently used scalers are min-max scaling and standard scaling. Min-Max scaling is where one simply scales all the data in a feature between down to a scale between min and max. Standard scaling, probably the most used way of scaling data, is a way to standardize the data, i.e. transforming it into a zero mean, unit variance feature distribution. Again, both scalers were tested, and the StandardScaler was found to give supperior results.
Feature Selection: Feature selection can be done manually, but it can also be done automatically. There are many ways to do feature selection, and some machine learning algorithms even have feature importance included as an attribute, so they themselves can be used as feature selection algorithms. Many of the feature selection methods implemented in sklearn, like SelectKBest and SelectPercentil, work by doing a univariate test and then selecting the $k$ best performing features or the $p$ percentile best features. Another way to do feature selection is to filter out features that have too low a variance. For this there is the VarianceThreshold, which will filter out features with a variance less than a certain threshold. Again, the tests were done in multiple ways, but I found that using the SelectKBest, between the feature scaling and feature reduction worked best. NOTE: I can't seem to wrap my head around why I get better performance when using SelectKBest before PCA and not the other way around. This might be because the principle components for PCA are worse of when I have all my initial 12 features instead of the reduced number of features I get after using SelectKBest.
features_list = ["poi"] + [k for k in data_df.keys() if k != "email_address" and k != "poi"]
only_features = features_list[:]
only_features.remove("poi")
data_df = pd.DataFrame(data_dict).T.fillna(0)
data_set = data_df.T.to_dict()
data = featureFormat(data_set, features_list, remove_all_zeroes=False, sort_keys = True)
labels, features = targetFeatureSplit(data)
kbest_fit = estimator.estimator.steps[2][1].set_params(k='all').fit(features, labels)
feature_scores_list = sorted([(f, s) for f, s in zip(only_features, kbest_fit.scores_)], key=lambda x: x[1], reverse=True)
for f, s in feature_scores_list:
print("{2}{0:<30s} = {1:>5.2f}".format(f, s, " " * 35))
features_list = ["poi"] + [k for k in data_df.keys() if k != "email_address" and k != "poi"]
only_features = features_list[:]
only_features.remove("poi")
data = featureFormat(my_dataset, features_list, remove_all_zeroes=False, sort_keys = True)
labels, features = targetFeatureSplit(data)
features = preprocessing.Imputer(missing_values="NaN", strategy='median', axis=0).fit_transform(features)
kbest_fit = estimator.estimator.steps[2][1].set_params(k='all').fit(features, labels)
feature_scores_list = sorted([(f, s) for f, s in zip(only_features, kbest_fit.scores_)], key=lambda x: x[1], reverse=True)
for f, s in feature_scores_list:
print("{2}{0:<30s} = {1:>5.2f}".format(f, s, " " * 35))
Feature Reduction: Feature or dimensionality reduction is another way to reduce the complexity of the training procedure, as well as a way to decrease the inter-feature correlation, and increase the feature variance. One such feature reduction algorithm that is widely used is Principle Component Analysis. This works by converting the set of possibly correlated features to a set of linearly uncorrelated principle components. PCA was the only dimensionality reduction algorithm tested.
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script.
cv = StratifiedShuffleSplit(labels, 50, random_state = 42)
search_params = {}
search_params['classifier__C'] = [1, 10]
search_params['classifier__gamma'] = [0.01, 0.001]
search_params['classifier__kernel'] = ['linear', 'rbf']
search_params['feature_selection__k'] = range(4,len(features_list))
search_params['feature_reduction__n_components'] = range(2, 5)
search_params['feature_reduction__whiten'] = [True, False]
grid_params = [search_params]
#estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring="f1")
#estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring=f05_scorer)
estimator = GridSearchCV(pipe, grid_params, cv=cv, scoring=kappa_scorer)
estimator.fit(features, labels)
clf = estimator.best_estimator_
from tester import *
test_classifier(clf, my_dataset, features_list)
In this section, I define my training method, and fit my pipeline to "some" testsets. I start by creating the variable cv, which is a cross validation set. The function StratifiedShuffleSplit, works by creating training and test sets, that are split in such a way that one has a similar amount of each class in each training set. Furthermore, it allows you to generate plenty of trainingset-testset splits, in my case 50, with the same randomization initiator, allowing for easy recomputation so that reported results can be verified. Other ways to split a dataset are to use the train_test_split, but his function only creates a single training set and test set.
Right after the definition of my cross validation set, I define a set of search parameters, which will be used by the GridSearchCV, function to try and find the optimal parameter settings for the pipeline. I plug the pipeline, grid parameters, cross validation set, and scoring method into the GridSearchCV function, and fit it to the features and labels. As one can see above, I tested three different scoring functions, namely F1, F0.5, and Cohen's Kappa. I found that Cohen's Kappa was slightly better than the F1 scorer.
When the estimator has been fit, the best estimator is selected and run through the test_classifier, report the results.
### Task 6: Dump classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
dump_classifier_and_data(clf, my_dataset, features_list)
To enable checking of my reported results, I dump the classifier, dataset and features list into .pkl object files.
This has been a very lengthy process, but I feel that there has been a lot of development. In the first draft, I reported a result of around 0.42 for both precision and recall. I am now reporting about the same values, but I do so, confident that my new pipeline doesn't include any data leaking features, nor does it lack validation methods, parameter tuning or data cleaning, scaling, and reduction. Using a pipeline with feature imputing, scaling, selection, reduction and classification, I managed to get an F1 score of around 0.43, with precision and recall of also ~0.43. By not using any of the data leaking poi features, I found it quite difficult to get this high of a F1 score without extensive parameter tuning. Therefore I am somewhat satisfied with the results. The slight nudge that still bothers me is that I haven't done any feature creation based on the email corpus, and I have a nagging hunch that this would really increase my classification score by a lot.
I will return with an update when I've parsed the email corpus.