Democratizing Data
at Airbnb
CHRIS WILLIAMS / JOHN BODLEY / MAY 11, 2017
Airbnb connects people to
unique travel experiences
The problem
tribal knowledge |ˈtrībəl ˈnäləj |
noun
Tribal knowledge is any unwritten information that is not commonly
known by others within a company
Relying on tribal knowledge stifles productivity
As Airbnb grows so do the challenges around the volume,
complexity, and obscurity of data
In a large and complex organization, with a sea of data
resources, users struggle to find the right data
Data is often siloed, inaccessible, or lacks context
I’m a recovering Data Scientist who wants to democratize
data, automate common workflows, surface relevant
information, and provide context
Tables in our Hive data warehouse
200k
> 10,000
Superset charts and
dashboards
> 6,000
Experiments and
metrics
> 6,000
Tableau workbooks
and charts
> 1,500
Knowledge posts
Data resources
Beyond the data warehouse
With many more data sources
and data types to love
and most importantly…
> 3,500 Airbnb employees
Portland
San Francisco
Los Angeles
Toronto
New York
Miami
Sao Paulo
Dublin
London
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
Seoul
Beijing
Tokyo
Sydney
Singapore
Washington, DC
> 20
Offices around the world
The mandate
To democratize data and empower Airbnb employees to be data-
informed by aiding with data exploration, discovery, and trust
The concept
Search…
It should be fairly evident what we feed into the search indices
But are we missing something?
The relevancy of relationships
Nodes and relationships have equal standing
created consumedSpoke 3
The graph
created
associated
associated
associated
consumed
consum
ed
created
consum
ed
The graph
created
associated
associated
associated
consumed
consum
ed
created
consum
ed
The graph
created
associated
associated
consumed
consum
ed
created
consum
ed
associated
The graph
associated
associated
associated
consumed
consum
ed
consum
ed
created
created
The graph
created
associated
associated
associated
consumed
created
consum
ed
consum
ed
The graph
created
associated
associated
associated consum
ed
created
consum
ed
consumed
The graph
created
associatedconsumed
consum
ed
created
consum
ed
associated
associated
The construction
Databases
6
APIs
4
Airflow DAG
1
Databases
6
APIs
4
Airflow DAG
1
We leverage all these data resources to build a graph in Hive
comprising of nodes and relationships
The workflow is run everyday though the graph is left to soak to prevent
flickering
Addressing graph flickering
Addressing graph flickering
The issue is certain types of relationships are sporadic in nature causing the
graph to flicker
Persistent vs. transient relationships
Persistent relationships represent a snapshot in time
createdSpoke 3
Persistent vs. transient relationships
Transient relationships represent events which are somewhat sporadic in nature
M Tu W Th F
consumedSpoke 3
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Integrative
It integrates well with
Python and
Elasticsearch
Why we choose Neo4j for our database
The four main reasons
The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
Elasticsearch plugin
Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the
search rankings by leveraging the graph topology
Node label hierarchy
:Entity
:Org
:Group :User
:Tableau
:Workbook:Chart
:Hive
:Schema :Table
jane_doe
(:Entity:Org:User {id: ‘jane_doe’})
(:Entity:Hive:Table {id: ‘dim_users’})
(:Entity:Tableau:Chart {id: ‘12345’})
dim_users
12345
MATCH (n:Entity:Org:User {id: ’<id>’})
USING INDEX n:User(id)
RETURN n
From local to global uniqueness
A mechanism to reference nodes in an abstract manner
GraphAware UUID plugin
Transparently assigns a globally unique UUID property to newly created elements (nodes and
relationships) which cannot be changed or deleted
Globally unique
Enables us to uniquely identify a single node via the Entity label and UUID property which
allows for parameterized queries which leads to faster query and execution times
MATCH (n:Entity {uuid: ’<uuid>’})
USING INDEX n:Entity(uuid)
RETURN n
/api/graph/nodes/org/user/<id>
/api/graph/nodes/<uuid>
/api/graph/relationships/<uuid>/created/<uuid>
The frontend
web app
Designing the interface and user experience of 

a data tool should not be an afterthought
Technical data power
user; the epitome of a
tribal knowledge
holder
Daphne Data
User personas
Less data literate;
needs to keep tabs on
her team’s resources
Manager Mel
New employee, new
team, or new to data;
has no idea what’s
going on
Nathan New
Designing for data exploration, discovery, and trust
Company dataSearch
Resource details

&metadata
User data Group data
Company dataSearch User data Group data
Resource details

&metadata
Search
Resource details 

&metadata
Company dataUser data Group data
Google-esque search filters
Resource details & metadata
Context, context, & context
Search
Resource details 

&metadata
Company dataUser data Group data
Surface relationships,
everything’s a link to promote
exploration
Metadata & consumption
Description, external link, social
Column details & value distributions
Table lineage
Enrich metadata on the fly
Search
Resource details 

&metadata
Company dataUser data Group data
Search
Resource details 

&metadata
Company dataUser data Group data
User details & 

metadata
What they make, 

what they consume
Search
Resource details 

&metadata
Company dataUser data Group data
Former employees also 

hold tribal knowledge
Search
Resource details 

&metadata
Company dataUser data Group data
Group overview
Search
Resource details 

&metadata
Company dataUser data Group data
Thumbnails for maximum context
Basic organization functionality
Pinterest-like curation & 

suggested content
We gather over 15,000 thumbnails from 

Tableau, Superset, and the Knowledge Repo
Search
Resource details 

&metadata
Company dataUser data Group data
Pinning flow from resource page
Edit mode / draggable grid
???? ??
Employees can feel disconnected
from Company-level metrics
Search
Resource details 

&metadata
Company dataUser data Group data
The technology stack
Application +
dependencies
DOM Testing
eslint
enzyme
mocha
chai
Application
state
Styling
khan/aphrodite
The challenges
Proxy nodes
Abstracting complexity
where necessary while
accurately modeling
the data ecosystem
Graph merging
Non-trivial Git-like
merging of graph
updates
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
The challenges
The future
Game-ification
Provide content
producers with a sense
of value
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Network analysis
Determine obsolete
nodes, critical paths,
lines of
communication, etc.
The future
The team
The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization
Thank you
Appendix
Naturally bidirectional relationships
associated
Dealing with mutual relationships
Naturally bidirectional relationships
associated
Modeling both creates an unnecessary relationship
associated
Naturally bidirectional relationships
associated
Most efficient solution is to use a single relationship in the many-to-one direction
CREATE TABLE nodes (
labels ARRAY<STRING>,
id STRING,
properties STRING
)
jane_doe
{
labels:[‘Org’,’User’],
id:’jane_doe’
}
{
labels:[‘Hive’,’Table’],
id:’dim_users’
}
{
labels:[‘Tableau’,’Chart’],
id:’12345’
}
dim_users
12345
CREATE TABLE relationships (
source STRUCT<labels:ARRAY<STRING>,id:STRING>,
target STRUCT<labels:ARRAY<STRING>,id:STRING>,
type STRING,
properties STRING
)
Efficient data retrieval
Solution
Create an index for every label keyed by the ID and UUID properties which in addition to index
hints provides optimal node retrieval
Problem
Indexes provide for efficient data retrieval similar to a RDBMS primary key, however they are
only defined for a single label as opposed to our tuple of hierarchical labels
Restrictions and workarounds with Neo4j indexes

Democratizing Data at Airbnb