Get Started with Data Science by Analyzing Traffic Data from California Highways

GET STARTED WITH DATA
SCIENCE
BYANALYZING TRAFFIC DATA FROM
CALIFORNIA HIGHWAYS
Cirrus Shakeri, Ph.D.
Inventurist LLC
1
With contribution and support from:
•  Aerospike Team: Brian Bulkowski, Monica Pal, Dash Desai, and Ondrej Jaura
•  GraphLab Team: Danny Bickson and Eduardo Rosini

The Short Story
•  What we wanted to achieve
•  To understand how data science works and what results can be achieved
•  How repeatable and reusable the data science techniques can be
•  Broaden the use of Aerospike in Realtime Analytics
•  The starting point: data
•  Hard to find the right dataset
•  Datasets from California highways
•  Application example: realtime detection of traffic incidents (a hypothesis)
•  The non-glamorous part of data science
•  Understanding the data and the domain in reasonable depth
•  Data cleansing and preparation
•  The glamorous part
•  The algorithms– can we treat machine learning as a black box?
•  The reality of Data Science hits: We need a data scientist!
•  Glimpses of hope
•  Some promising results but the hard work still remains—please join the effort
2

THE LONGER STORY
Our lessons learned and how you can apply these
techniques in your applications
3

Search for the right dataset
•  Criteria
•  Open, realtime, interesting, not-a-toy dataset, and not-creepy!
•  California highways sensor data
•  16,490 stations, 41,470 sensors across California highways
5
Courtesy of California Department of Transportation
Start from Data

Checklist for understanding the data
•  Download a sample and study it
•  what are the attributes, what is the range of values, …
•  How is data structured?
•  tables, documents, graph, etc.
•  How do we access it?
•  API, file downloads, …
•  How is the quality of data?
•  missing values, incorrect values, …
•  How is data generated?
•  How fast? Do we have access to the historical data? How much of it?
•  Are there legal and ethical issues in analyzing the data
•  privacy, security, …
6Understand the Dataset

Data format and structure
“d04_text_station_raw_2014_10_29.txt”
7
CSV format
8,403,282 lines
565.5 MB
Understand the Dataset

What exactly is in the data
8
station ID
Average speed for lane 1:
44 miles per hour
lane 2 lane 3 lane 4 lane 5
vehicle flow for lane 1:
17 vehicles passed in 30 sec on lane 1
average occupancy for lane 1
lane 1 was occupied 18% of time in 30 sec

Deeper study of the data
•  GraphLab + iPython Notebook
•  Both open source and free
9
www.dato.com
ipython.org/notebook.html
Pick Tools

Load Sample Data
10
graphlab.SFrame.read_csv(url_d04_text_station_raw_2014_10_29, ...)
...
PROGRESS: Unable to parse line "10/29/2014 00:44:36,400728,2,...
PROGRESS: Read 778502 lines. Lines per second: 174789
PROGRESS: 577297 lines failed to parse correctly
PROGRESS: Finished parsing file /Users/cirrus/Documents/Aerospike/...
PROGRESS: Parsing completed. Parsed 7825984 lines in 19.8226 secs.

Study Range of Values in the Dataset
11
!

Lesson Learned: Need a fast database
•  Exploring sample files worked as a start
•  Next we needed to slice and dice a large amounts of the traffic data in
various ways
•  How is speed at a specific station changes over the course of a year at a
specific time of day?
•  Are there repeatable seasonal or weekly patterns?
•  What about traffic behavior along a specific highway, say US-101 in Bay Area?
•  ...
•  This requires loading a large amount of historical data in a database
and running queries
•  Exploring data should be interactive so the database must be fast
•  Current plan
•  Load a good portion of the highway traffic data into Aerospike
•  Provide developer community free API access
•  Experiment with the data and collaborative on building applications
•  The motivation is to open up the use of the Aerospike technology to different
applications especially in realtime analytics
12Pick Tools

Good Data vs. Bad Data
13Check the Data Quality

Imputed Data: 5-min aggregate
14Check the Data Quality

Two more datasets
•  Station locations
•  Incident data
15Add More Datasets

Detecting Incidents in Realtime
•  Can we detect accidents based on sensor data?
•  Is there a value in such a prediction?
•  Can we detect accidents faster than the way they are reported
now?
•  Is it doable? Has been done before?
•  Can we ‘predict’ accidents?
•  A better understanding of the domain is needed
16Generate Application Ideas

Literature Survey
•  Types of incidents: Collision, Debris, Break-down
18Study the Domain
Reference: “REAL-TIME DETECTION OF ROAD TRAFFIC INCIDENTS”, by PERO ŠKORPUT, et al

Literature Survey
19Study the Domain
Reference: “REAL-TIME DETECTION OF ROAD TRAFFIC INCIDENTS”, by PERO ŠKORPUT, et al

Can incidents be detected in near realtime?
(under 1 minute)
20
“The survey responses point to a general consensus that the unacceptably high rates of
false alarms generated by available incident detection algorithms is the major
deterrent.”
Study the Domain

Algorithmic Approaches
•  Comparative algorithms
•  Current traffic data is compared with the ‘normal’ threshold
•  Statistical algorithms
•  Current traffic data is compared with forecasted data
•  Traffic theory based algorithms
•  Rapid change in speed while flow or occupancy do not change
considerably (based on ‘catastrophe theory’)
•  Advanced algorithms
•  AI and neural networks
21Study the Domain
Reference: “
OVERVIEW TO SOME INCIDENT DETECTION ALGORITHMS: A COMPARATIVE EVALUATION WITH ISTANBUL
FREEWAY DATA” by Onur Deniz, et al

Study Patterns and Correlations
23Explore the Data

Exploring the Data – Speed
24Explore the Data

Exploring the Data – Occupancy
25Explore the Data

Exploring the Data – Flow
26Explore the Data

FORMULATE THE DATA
SCIENCE APPROACH
27

Detecting Incidents as a Classification
Problem
•  Features
•  Started simple: Flow, occupancy and speed
•  Next: prepare training data
28
Any individual record classified as incident or no-incident:
Prototype the Solution

Classifiers in GraphLab
•  Classification models supported in GraphLab
•  support vector machines (SVM)
•  classify instances based on a linear function of the features
•  logistic regression
•  estimates the probability of instances belonging to a class as a
logistic function of a linear combination of features
•  boosted trees
•  combines the results of a set of base classification decision
trees
•  neural networks
•  middle layers of the network compute composite, intermediary
features
•  Also, a smart interface that selects the right model
based on the data
29Prototype the Solution

PREPARE TRAINING DATA
TRAIN THE MODEL
EVALUATE THE MODEL
30

Why do we need training data?
•  Supervised learning: a classifier is trained via a dataset
that is labeled with positive and negative instances
31
downstream occupancy / speed
upstreamoccupancy/speed
incident
no-incident
?

Training Data – the end result

Train the Classifier
•  This is the ‘smart interface’ where GraphLab decides
which classifier to us
•  Note that the features are simply flow, speed and
occupancy
•  Is this model oversimplified?
33
model =
graphlab.classifier.create(final_training_data_d0_2014_1
0_29, target='incident_happened’,
features=['total_flow', 'avg_speed', 'avg_occupancy'])

First results were embarrassing!
•  Basically no incident could be detected in the test data
•  The reality of data science hit!
•  Leveraged the GraphLab on-line community
•  Worked with GraphLab Data Scientist: Danny Bickson
•  Made some changes
•  Added more features
•  Station id, Num of lanes, Direction of travel, Absolute postmile, Hour/
Min/Day that incident occurred
•  Balanced the negative and positive examples
•  Tagged station data in a 3-mile vicinity of an incident as positive
instances
•  Created classifiers for specific highways (e.g., US-101)
•  …

Training Data Preparation Steps
•  Load CSV files for 5-min aggregate sensor data
•  From Oct 29 – Nov 4, 2014
•  In District 4 (Bay Area)
•  Prepare sensor dataset
•  Filter for highway US-101
•  1,861,626 records
•  Convert data types
•  String to datatime
•  Split timestamp and add as separate features
•  Add day of week as extra feature
•  Drop records with unknown values
•  1,447,195 records

•  Load CSV file for sensor locations
•  920 sensor locations for US101S in District 4
•  Convert highway data to match sensor data
•  SR101S à highway = 101, direction of travel = ‘S’
•  Join with sensor data
•  Load CSV file for incidents from Oct 29 – Nov 4
•  Covert datatypes, drop records with unknown values, …
•  Find sections of highways 101 in District 4
•  Filter for incident data for 101S in District 4
•  176 incidents
•  Create a nearest neighbor model for sensor locations
•  To query the nearest sensors to an incident

•  For each incident on highway 101S in Bay Area between Oct
29 and Nov 4:
•  Find the sensor stations in the 3-miles radius
•  For each sensor in the 3 miles radius
•  Extract the sensor data for any timestamp that falls within the ‘duration of
the incident’
•  Add sensor data to the ‘positive examples’ in the training data
•  28,460 records
•  Note: upstream/downstream stations not taken into account
•  ‘Negative Examples’ data (no incidents) is un-proportionally
larger (10 times)
•  Extract a 10% sample of the negative examples data
•  Add sampled sensor data for negative examples to the training data

SVM – Results
38
Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
| 0 | 0 | 3746 |
| 0 | 1 | 12330 |
| 1 | 0 | 3522 |
| 1 | 1 | 3711 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.31991934445922177}

Boosted Trees – Results
39
Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
| 0 | 0 | 15082 |
| 0 | 1 | 994 |
| 1 | 0 | 5574 |
| 1 | 1 | 1659 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.7182204298768716}

Ideas for further improving the results
•  Work with ‘raw’ station data instead of 5-min aggregates
•  Further explore the data for entire 2014 and entire California highways
•  It’s not easy when you are dealing with files: load into Aerospike to rescue
•  Add more features
•  difference between values in neighboring stations
•  difference between consecutive times at the same location
•  ...
•  Try different models for rush-hour and off-peak hours
•  Take into account seasonal effects
•  Validate the actual time of incident in training data
•  Include upstream and downstream data as separate features
•  Investigate causality
•  Is an incident causing slow-down or is it the other way around?
•  Stretch goal: Can we ‘predict’ incidents before happening?
•  Leading indicators for accidents
•  Improve safety of self-driving cars

Data Science – it’s iterative
42
Data
Domain
Problem
Data
Science
Prototype
Product
Last Word

Take Away 1: Realtime Analytics
•  Realtime (fast) vs. offline
analytics (slow)
•  Getting closer to how human brain
makes decisions
The next big thing:
•  The line between capturing and
analyzing data is disappearing
•  And technologies such as Aerospike
and GraphLab are helping!
44Last Word

Take Away 2: Where does all this lead?
1.  More intelligence for us
•  ‘Augmenting Human Intellect’
•  Doug Engelbart
2.  More automation for machines
•  ‘The Second Machine Age’
•  Erik Brynjolfsson, Andrew McAfee
45Last Word

THANK YOU
Cirrus Shakeri, Ph.D.
Co-Founder, Inventurist
+1 650-380-9794
twitter.com/cirrus_shakeri
www.linkedin.com/in/cshakeri/
cirrus.shakeri@inventurist.com
46

Location (mile)
(Abs PM)
Time
00:00:00
23:55:00
0.0 5.0 10.0
Incident
Incident start time
Incident
location
Station
location
5-min aggregate
end timestamp
5 min
Distance bet stations
5-min aggregate
start timestamp
Visualize an incident at a time and location
47

Get Started with Data Science by Analyzing Traffic Data from California Highways

More Related Content

What's hot

Viewers also liked

Similar to Get Started with Data Science by Analyzing Traffic Data from California Highways

More from Aerospike, Inc.

Recently uploaded

Get Started with Data Science by Analyzing Traffic Data from California Highways