Python: Making scikit-learn and Pandas Play Nice
In the last post I wrote about Nathan and my attempts at the Kaggle Titanic Problem, I mentioned that our next step was to try out scikit-learn, so I thought I should summarize where we’ve got up to. We needed to write a classification algorithm to work out whether a person onboard the Titanic survived, and luckily, scikit-learn has extensive documentation on each of the algorithms. Unfortunately almost all those examples use numpy data structures and we’d loaded the data using pandas and didn’t particularly want to switch back! Luckily it was really easy to get the data into numpy format by calling ‘values’ on the pandas data structure, something we learnt from a reply on Stack Overflow. For example if we were to wire up an ExtraTreesClassifier which worked out survival rate based on the ‘Fare’ and ‘Pclass’ attributes we could write the following code: import pandas as pd from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score train_df = pd.read_csv("train.csv") et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0) columns = ["Fare", "Pclass"] labels = train_df["Survived"].values features = train_df[list(columns)].values et_score = cross_val_score(et, features, labels, n_jobs=-1).mean() print("{0} -> ET: {1})".format(columns, et_score)) To start with with read in the CSV file which looks like this: $ head -n5 train.csv PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S Next we create our classifier which “fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.“ i.e. a better version of a random forest. On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier. Finally we call the cross_val_score function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter. If we run this code we’ll get roughly the following output: $ python et.py ['Fare', 'Pclass'] -> ET: 0.687991021324) This is actually a worse accuracy than we’d get by saying that females survived and males didn’t. We can introduce ‘Sex’ into the classifier by adding it to the list of columns: columns = ["Fare", "Pclass", "Sex"] If we re-run the code we’ll get the following error: $ python et.py An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (514, 0)) ... Traceback (most recent call last): File "et.py", line 14, in et_score = cross_val_score(et, features, labels, n_jobs=-1).mean() File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/cross_validation.py", line 1152, in cross_val_score for train, test in cv) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__ self.retrieve() File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 450, in retrieve raise exception_type(report) sklearn.externals.joblib.my_exceptions.JoblibValueError/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6 self.message, : JoblibValueError ___________________________________________________________________________ Multiprocessing exception: ... ValueError: could not convert string to float: male ___________________________________________________________________________ This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier – in this case ‘Sex’ has the values ‘female’ and ‘male’. We’ll need to write a function to replace those values with numeric equivalents. train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == "male" else 1) Now if we re-run the classifier we’ll get a slightly more accurate prediction: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) The next step is to use the classifier against the test data set, so let’s load the data and run the prediction: test_df = pd.read_csv("test.csv") et.fit(features, labels) et.predict(test_df[columns].values) Now if we run that: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 22, in et.predict(test_df[columns].values) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict proba = self.predict_proba(X) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba X = array2d(X, dtype=DTYPE) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 91, in array2d X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 235, in asarray return array(a, dtype, copy=False, order=order) ValueError: could not convert string to float: male which is the same problem we had earlier! We need to replace the ‘male’ and ‘female’ values in the test set too so we’ll pull out a function to do that now. def replace_non_numeric(df): df["Sex"] = df["Sex"].apply(lambda sex: 0 if sex == "male" else 1) return df Now we’ll call that function with our training and test data frames: train_df = replace_non_numeric(pd.read_csv("train.csv")) ... test_df = replace_non_numeric(pd.read_csv("test.csv")) If we run the program again: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 26, in et.predict(test_df[columns].values) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict proba = self.predict_proba(X) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba X = array2d(X, dtype=DTYPE) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 93, in array2d _assert_all_finite(X_2d) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 27, in _assert_all_finite raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity. There are missing values in the test set so we’ll replace those with average values from our training set using an Imputer: from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='mean', axis=0) imp.fit(features) test_df = replace_non_numeric(pd.read_csv("test.csv")) et.fit(features, labels) print et.predict(imp.transform(test_df[columns].values)) If we run that it completes successfully: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) [0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0] The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle. The type of those values is ‘numpy.ndarray’ which we can convert to a pandas Series quite easily: predictions = et.predict(imp.transform(test_df[columns].values)) test_df["Survived"] = pd.Series(predictions) We can then write the ‘PassengerId’ and ‘Survived’ columns to a file: test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'], index=False) Then output file looks like this: $ head -n5 foo.csv PassengerId,Survived 892,0 893,1 894,0 The code we’ve written is on github in case it’s useful to anyone.
November 14, 2013
by Mark Needham
·
28,674 Views
·
1 Like