# Predicting Human Activity from Smartphone Accelerometer and Gyroscope Data

How does my Fitbit track my steps? I always assumed it was pretty accurate, but I never actually knew how it worked.

My Fitbit uses a 3-axial accelerometer to track my motion, according to the company’s website. The tracker then uses an algorithm to determine whether I’m walking or running (as opposed to standing still or driving in a car).

Cool. But how does it actually do that? And how good is it?

Since Fitbit probably won’t release its code anytime soon, I decided to find out how well I could design an algorithm to differentiate walking from staying still (as well as from other activities). After a quick google search, I found a dataset of 3-axial accelerometer signals from an academic experiment on the UC Irvine Machine Learning Repository. The data originally come from SmartLab at the University of Genova. You can download the data here.

The data are pre-split into training and test sets, so we’ll read them in separately. To get the feature names and the activity labels, we have to read separate files too.

import pandas as pd
from scipy.spatial.distance import cosine
import numpy as np
from scipy.sparse import csr_matrix
from __future__ import division
import matplotlib.pyplot as plt
%matplotlib inline

# display pandas results to 3 decimal points, not in scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)

with open('/Users/nickbecker/Downloads/HAPT Data Set/features.txt') as handle:
features = map(lambda x: x.strip(), features)

activity_labels = map(lambda x: x.strip(), activity_labels)

activity_df = pd.DataFrame(activity_labels)
activity_df = pd.DataFrame(activity_df[0].str.split(' ').tolist(),
columns = ['activity_id', 'activity_label'])
activity_df

activity_idactivity_label
01WALKING
12WALKING_UPSTAIRS
23WALKING_DOWNSTAIRS
34SITTING
45STANDING
56LAYING
67STAND_TO_SIT
78SIT_TO_STAND
89SIT_TO_LIE
910LIE_TO_SIT
1011STAND_TO_LIE
1112LIE_TO_STAND

So we have 12 activities, ranging from sitting down to walking up the stairs. The SmartLab researchers created 561 features from 17 3-axial accelerometer and gyroscope signals from the smartphone. These features capture descriptive statistics and moments of the 17 signal distributions (mean, standard deviation, max, min, skewness, etc.). They also did some reasonable pre-processing to de-noise and filter the sensor data (you can read about it here – it’s pretty reasonable). Let’s look at a few features.

x_train = pd.read_table('/Users/nickbecker/Downloads/HAPT Data Set/Train/X_train.txt',
header = None, sep = " ", names = features)

00.044-0.006-0.035-0.995-0.988-0.937-0.995-0.989-0.953-0.795
10.039-0.002-0.029-0.998-0.983-0.971-0.999-0.983-0.974-0.803
20.040-0.005-0.023-0.995-0.977-0.985-0.996-0.976-0.986-0.798
30.040-0.012-0.029-0.996-0.989-0.993-0.997-0.989-0.993-0.798
40.039-0.002-0.024-0.998-0.987-0.993-0.998-0.986-0.994-0.802
y_train = pd.read_table('/Users/nickbecker/Downloads/HAPT Data Set/Train/y_train.txt',
header = None, sep = " ", names = ['activity_id'])

activity_id
05
15
25
35
45
x_test = pd.read_table('/Users/nickbecker/Downloads/HAPT Data Set/Test/X_test.txt',
header = None, sep = " ", names = features)
header = None, sep = " ", names = ['activity_id'])


Now that we’ve got the train and test data loaded into memory, we can start building a model to predict the activity from the acceleration and angular velocity features.

# Building a Human Activity Classifier

Let’s build a model and plot the cross-validation accuracy curves for the training data. We’ll split x_train and y_train into training and validation sets since we only want to predict on the test set once. We’ll use 5-fold cross validation to get a good sense of the model’s accuracy at different values of C. To get the data for the curves, we’ll use the validation_curve function from sklearn.

from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import validation_curve

C_params = np.logspace(-6, 3, 10)
clf_svc = LinearSVC(random_state = 12)

train_scores, test_scores = validation_curve(
clf_svc, x_train.values, y_train.values.flatten(),
param_name = "C", param_range = C_params,
cv = 5, scoring = "accuracy", n_jobs = -1)


We can plot the learning curves by following the learning_curve documentation. First, I’ll calculate the means and standard deviations for the plot shading.

import seaborn as sns

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

y_min = 0.5
y_max = 1.0

sns.set(font_scale = 1.25)
sns.set_style("darkgrid")
f = plt.figure(figsize = (12, 8))
ax = plt.axes()
plt.title("SVM Training and Validation Accuracy")
plt.xlabel("C Value")
plt.ylabel("Accuracy")
plt.ylim(y_min, y_max)
plt.yticks(np.arange(y_min, y_max + .01, .05))
plt.semilogx(C_params, train_scores_mean, label = "CV Training Accuracy", color = "red")
plt.fill_between(C_params, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha = 0.2, color = "red")
plt.semilogx(C_params, test_scores_mean, label = "CV Validation Accuracy",
color = "green")
plt.fill_between(C_params, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha = 0.2, color = "green")
plt.legend(loc = "best")
plt.show()


From the graph, it looks like the best value of C is at 10-1. The validation accuracy begins slowly decreasing after that 10-1, indicating we are overfitting. The validation curve looks great, but we only optimized on C. We don’t have to use a linear kernel. We could do a grid search on different kernels and C values. With a larger search space, we might get a different set of optimal parameters.

Let’s run a parameter grid search to see what the optimal parameters are for the SVM model. We’ll switch to the general Support Vector Classifier in sklearn so we can have the option to try non-linear kernels.

from sklearn.svm import SVC

Cs = np.logspace(-6, 3, 10)
parameters = [{'kernel': ['rbf'], 'C': Cs},
{'kernel': ['linear'], 'C': Cs}]

svc = SVC(random_state = 12)

clf = GridSearchCV(estimator = svc, param_grid = parameters, cv = 5, n_jobs = -1)
clf.fit(x_train.values, y_train.values.flatten())

print clf.best_params_
print clf.best_score_

{'kernel': 'rbf', 'C': 1000.0}
0.925968842539


So the best cross-validated parameters are a rbf kernel with C = 1000. Let’s use the best model to predict on the test data.

clf.score(x_test, y_test)

0.94339025932953824


94% accuracy. That seems pretty high. But how do we know how good that actually is?

# Evaluating the Model

How much better is the model than the no-information rate model? The no-information rate is the accuracy we’d get if we guessed the most common class for every observation. It’s the best we could do with no features.

y_test.activity_id.value_counts().values[0] / y_test.activity_id.value_counts().values.sum()

0.17583807716635041


Our model’s accuracy of 94% looks even better compared to the baseline of 18%. That’s over five times more accurate than the no-information accuracy. But where does our model fail? Let’s look at the crosstab.

crosstab = pd.crosstab(y_test.values.flatten(), clf.predict(x_test),
rownames=['True'], colnames=['Predicted'],
margins=True)
crosstab

Predicted123456789101112All
True
148844000000000496
2204491000100000471
3516399000000000420
4020453520010000508
5000195360100000556
600000544000010545
7010300161101023
8000000010000010
90000001020011032
10000000000190625
110001011011134049
12000000000931527
All5134724044765885452012322950213162

We do really well for activity_ids 1-3, 5, 6, and 8, but much worse for 4, 7, and 9-12. Why is that?

One possible answer is that we don’t have enough data. The All column on the right side of the crosstab indicates that we have way fewer observations for activities 7-12. Getting more data may improve the model, depending on whether it’s a bias or variance issue.

Additionally, it’s clear the model seems to be systematically mistaking some activities for others (activities 4 and 5, 9 and 11, and 10 and 12 are confused for each other more than others). Maybe there’s a reason for this. Let’s put the labels on the activity_ids and see if we notice any patterns.

crosstab_clean = crosstab.iloc[:-1, :-1]
crosstab_clean.columns = activity_df.activity_label.values
crosstab_clean.index = activity_df.activity_label.values
crosstab_clean

WALKINGWALKING
_UPSTAIRS
WALKING
_DOWNSTAIRS
SITTINGSTANDINGLAYINGSTAND_
TO_SIT
SIT_
TO_STAND
SIT_
TO_LIE
LIE_
TO_SIT
STAND_
TO_LIE
LIE_
TO_STAND
WALKING48844000000000
WALKING_
UPSTAIRS
204491000100000
WALKING_
DOWNSTAIRS
516399000000000
SITTING020453520010000
STANDING000195360100000
LAYING00000544000010
STAND_
TO_SIT
0103001611010
SIT_
TO_STAND
0000000100000
SIT_
TO_LIE
00000010200110
LIE_
TO_SIT
0000000001906
STAND_
TO_LIE
00010110111340
LIE_
TO_STAND
0000000009315

That makes it way more clear. The model struggles to classify those activity_ids because they are similar actions. 4 and 5 are both stationary (sitting and standing), 9 and 11 both involving lying down (sit_to_lie and stand_to_lie), and 10 and 12 both involve standing up from a resting position (lie_to_sit and lie_to_stand). It makes sense that accelerometer and gyroscope data on these actions would be similar.

So with 94% accuracy in this activity classifier scenario, can we start a \$3 Billion dollar fitness tracker company? Maybe.

For the simple Fitbit bracelet, we don’t really need to distinguish between the 12 different activities themselves–only whether or not the person is walking. If we could predict whether someone is walking or not from the smartphone with near perfect accuracy, we’d be in business.

## Predicting Walking vs. Not Walking

So how do we classify walking? First we need to convert the labels to a binary to indicate whether they represent walking or staying in place. From the activity labels, we know that 1-3 are walking and everything other activity involves staying in place.

y_train['walking_flag'] = y_train.activity_id.apply(lambda x: 1 if x <= 3 else 0)
y_test['walking_flag'] = y_test.activity_id.apply(lambda x: 1 if x <= 3 else 0)


Now, we’ll train a new SVM model and find the optimal parameters with cross-validation.

from sklearn.svm import SVC

Cs = np.logspace(-6, 3, 10)
parameters = [{'kernel': ['rbf'], 'C': Cs},
{'kernel': ['linear'], 'C': Cs}]

svc = SVC(random_state = 12)

clf_binary = GridSearchCV(estimator = svc, param_grid = parameters, cv = 5, n_jobs = -1)
clf_binary.fit(x_train.values, y_train.walking_flag.values.flatten())

print clf_binary.best_params_
print clf_binary.best_score_

{'kernel': 'linear', 'C': 0.10000000000000001}
0.999098751127

clf_binary.score(x_test, y_test.walking_flag)

0.99905123339658441


99.9% accuracy! Let’s look at the crosstab.

crosstab_binary = pd.crosstab(y_test.walking_flag.values.flatten(), clf_binary.predict(x_test),
rownames=['True'], colnames=['Predicted'],
margins=True)
crosstab_binary

Predicted01All
True
0177411775
1213851387
All177613863162

That’s beautiful! We can almost perfectly tell when someone is walking from smartphone accelerometer and gyroscope data. Now we just need a second model to estimate the distance traveled and we’ll be ready to compete with Fitbit.

For those interested, the Jupyter Notebook with all the code can be found in the Github repository for this post.