Scaling Data Science
At Stitch Fix
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
October 2017
Try out Stitch Fix → goo.gl/Q3tCQ3
How many
Data Scientists do you have?
At Stitch Fix we have ~80
Two Data Scientist facts:
1. Ability to spin up their own
resources*.
2. End to end,
they’re responsible.
But what do they do?
What is Stitch Fix?
Personal Styling Service
5000+ Job Definitions
Lots of Compute &
Data Movement!
So how did we get to our scale?
Helping Data Scientists
get out of each other’s way!
Non-Technical Answer:
Reducing Contention
Technical Answer:
&
Unhappy Data Scientists Burning Infrastructure
Contention is Correlated with
Contention on:
● Access to Data
● Access to Compute Res.
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Focus of this talk:
Fellow Collaborators
jeff akshay jacob
tarek
kurt derek
patrick
thomas
Horizontal team focused on Data Scientist Enablement
steven liz alex paul jana
neelesh
chris
nik sky juliet wei adam
Data Access:
Unhappy DS &
Burning Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Limited by
tools
Limited by
tools
So how does Stitch Fix
mitigate these problems?
Data Access:
S3 & Hive Metastore
What is S3?
● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Can read, write, delete, BUT NOT append (or overwrite).
● Lots of companies rely on it -- famously Dropbox.
What is S3?
* For all intents and purposes
S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Experienced
Space “Infinite”
Tooling Lots of Options
● Data Scientists’ main datastore since very early on.
● S3 essentially removes any real worries with respect to data contention!
S3 is not a complete solution!
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
What is the Hive Metastore?
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build more tooling on top of
● Enable use of existing open source solutions
● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read & write performance.
○ “Contention Free”
● Decoupled data storage layer from data manipulation.
○ Very amenable to supporting a lot of different data sets and tools.
S3 + Hive Metastore
Our Current Picture
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
Replacing a file on S3
B
A
Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to trust the data.
A
B
* for existing files
● Recall: Hive Metastore controls partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● What do we mean by “new place”?
○ Use an inner directory → called Batch ID
Avoiding Eventual Consistency
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
Data Access:
Tooling Integration
Recall
Recall
?
?
?
?
?
?
Data Access:
Tooling Integration
1. Enforcing Batch IDs
2. File Formats
3. Schemas for all Tools
4. Schema Evolution
5. Redshift
6. Spark
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all tooling!
1. Enforcing Batch IDs
1. Enforcing Batch IDs via an API
1. Enforcing Batch IDs via an API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
df <- sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
1. Enforcing Batch IDs: APIs for DS
1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal API Internal API
Spark Standard API Internal API
PySpark Standard API Internal API
Presto Standard API N/A
Redshift Load via Internal API N/A
● Problem:
○ What format do you use to work with all the tools?
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
● Philosophy: minimize for operational burden:
○ Choose `0`, i.e. null delimited, gzipped files
■ Easy to write an API for this, for all tools.
2. File Format
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all Tools
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Define parallel schemas, that have specific types redefined in Hive
Metastore
■ E.g.
● Can redefine decimal type to be double for Presto*.
● This parallel schema would be named prod_presto.
○ Still points to same underlying data.
3. Schemas for all Tools
* It didn’t use to have functioning decimal support
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
4. Schema Evolution
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
● Solution:
○ Append columns to end of schemas.
○ Rename columns as deprecated -- breaks code, but not data.
4. Schema Evolution
● Wait, what? Redshift?
5. Redshift
● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS had to load data & perform joins in R/Python
○ Data Scientists loved Redshift too much:
■ It became a huge source of contention
■ Have been migrating “production” off of it
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3+HM in sync with Redshift?
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3+HM in sync with Redshift?
● Solution:
○ API that abstracts syncing data with Redshift
■ Keeps schemas in sync
■ Uses standard data warehouse staged table insertion pattern
5. Redshift
● What does our integration with Spark look like?
6. Spark
● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3 still source of truth
■ Have custom write API:
● Enforces Batch IDs
● Scala based library making use of EMRFS
● Also exposed in Python for PySpark use
○ Heavy users of Spark SQL
○ It’s the main production workhorse
6. Spark
Ad-hoc
Compute Access:
Using Docker
Data Scientist’s Ad-hoc workflow
Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Docker?
● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R
○ RStudio
○ Scientific C libs pre-installed
● Mounts EFS (NFS)
● Terminal Access (for git, file system, logs, etc):
○ SSH access to container
○ Browser based terminal
Ad-Hoc Docker Image
Self Service Ad-hoc Infra: Interactive Notebooks
Jupyter Hub on Interactive Notebooks
RStudio on Interactive Notebooks
Browser Based Terminal on Interactive Notebooks
Interactive Notebooks Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image.
● Runs in a single Auto Scale Group.
● S3 backed self-hosted Artifactory as docker repository.
● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
Interactive Notebooks Deployment
Interactive Notebooks Deployment
Interactive Notebooks Deployment
Interactive Notebooks Deployment
Interactive Notebooks Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS (EFS)
● Docker Hub:
○ Weren’t happy with performance
○ Switched to artifactory
Docker Problems So Far
In Summary
● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
● Docker is used to provide a consistent environment for Data Scientists to use.
● Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
In Summary - Scaling DS @ Stitch Fix
Fin; Thanks! Questions?
@stefkrawczyk
Try out Stitch Fix → stitchfix.com/referral/8406746

Data Day Seattle 2017: Scaling Data Science at Stitch Fix

  • 1.
    Scaling Data Science AtStitch Fix Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk October 2017 Try out Stitch Fix → goo.gl/Q3tCQ3
  • 2.
  • 3.
    At Stitch Fixwe have ~80
  • 4.
    Two Data Scientistfacts: 1. Ability to spin up their own resources*. 2. End to end, they’re responsible.
  • 5.
    But what dothey do?
  • 6.
  • 7.
  • 17.
  • 18.
    Lots of Compute& Data Movement!
  • 19.
    So how didwe get to our scale?
  • 20.
    Helping Data Scientists getout of each other’s way! Non-Technical Answer:
  • 21.
  • 22.
    & Unhappy Data ScientistsBurning Infrastructure Contention is Correlated with
  • 23.
    Contention on: ● Accessto Data ● Access to Compute Res.
  • 24.
    Contention on: ● Accessto Data ● Access to Compute Res. ○ Ad-hoc ○ Production
  • 25.
    Contention on: ● Accessto Data ● Access to Compute Res. ○ Ad-hoc ○ Production Focus of this talk:
  • 26.
    Fellow Collaborators jeff akshayjacob tarek kurt derek patrick thomas Horizontal team focused on Data Scientist Enablement steven liz alex paul jana neelesh chris nik sky juliet wei adam
  • 27.
    Data Access: Unhappy DS& Burning Infrastructure
  • 28.
    Data Access: ☹DS & Infrastructure
  • 29.
    Data Access: ☹DS & Infrastructure
  • 30.
    Data Access: ☹DS & Infrastructure Can’t write fast enough
  • 31.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough
  • 32.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact
  • 33.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space
  • 34.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space Limited by tools Limited by tools
  • 35.
    So how doesStitch Fix mitigate these problems?
  • 36.
    Data Access: S3 &Hive Metastore
  • 37.
  • 38.
    ● Amazon’s SimpleStorage Service. ● Infinite* storage. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Can read, write, delete, BUT NOT append (or overwrite). ● Lots of companies rely on it -- famously Dropbox. What is S3? * For all intents and purposes
  • 39.
    S3 @ StitchFix S3 Writing Data Hard to Saturate Reading Data Hard to Saturate Writing & Reading Interference Haven’t Experienced Space “Infinite” Tooling Lots of Options ● Data Scientists’ main datastore since very early on. ● S3 essentially removes any real worries with respect to data contention!
  • 40.
    S3 is nota complete solution!
  • 41.
    What is theHive Metastore?
  • 42.
    ● Hadoop service,that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition What is the Hive Metastore?
  • 43.
    ● Hadoop service,that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: What is the Hive Metastore? Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  • 44.
    Hive Metastore @Stitch Fix Brought into: ● Bring centralized order to data being stored on S3 ● Provide metadata to build more tooling on top of ● Enable use of existing open source solutions
  • 45.
    ● Our centralsource of truth! ● Never have to worry about space. ● Trading for immediate speed, you have consistent read & write performance. ○ “Contention Free” ● Decoupled data storage layer from data manipulation. ○ Very amenable to supporting a lot of different data sets and tools. S3 + Hive Metastore
  • 46.
  • 47.
  • 48.
    ● Replacing datain a partition Caveat: Eventual Consistency
  • 49.
    ● Replacing datain a partition Caveat: Eventual Consistency
  • 50.
  • 51.
    Replacing a fileon S3 ● S3 is eventually consistent* ● These bugs are hard to track down ● Need everyone to be able to trust the data. A B * for existing files
  • 52.
    ● Recall: HiveMetastore controls partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● What do we mean by “new place”? ○ Use an inner directory → called Batch ID Avoiding Eventual Consistency
  • 53.
  • 54.
    Batch ID Pattern DateLocation 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  • 55.
    ● Overwriting apartition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 56.
    ● Overwriting apartition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 57.
    ● Avoids eventualconsistency issue ● Jobs finish on the data they started on ● Full partition history: ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  • 58.
  • 59.
  • 60.
  • 61.
    Data Access: Tooling Integration 1.Enforcing Batch IDs 2. File Formats 3. Schemas for all Tools 4. Schema Evolution 5. Redshift 6. Spark
  • 62.
    ● Problem: ○ Howdo you enforce remembering to add a Batch ID into your S3 path? 1. Enforcing Batch IDs
  • 63.
    ● Problem: ○ Howdo you enforce remembering to add a Batch ID into your S3 path? ● Solution: ○ By building APIs ■ For all tooling! 1. Enforcing Batch IDs
  • 64.
    1. Enforcing BatchIDs via an API
  • 65.
    1. Enforcing BatchIDs via an API
  • 66.
    Python: store_dataframe(df, dest_db, dest_table,partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) df <- sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) 1. Enforcing Batch IDs: APIs for DS
  • 67.
    1. Enforcing BatchIDs: APIs for DS Tool Reading From S3+HM Writing to S3+HM Python Internal API Internal API R Internal API Internal API Spark Standard API Internal API PySpark Standard API Internal API Presto Standard API N/A Redshift Load via Internal API N/A
  • 68.
    ● Problem: ○ Whatformat do you use to work with all the tools? 2. File Format
  • 69.
    ● Problem: ○ Whatformat do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers 2. File Format
  • 70.
    ● Problem: ○ Whatformat do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers ● Philosophy: minimize for operational burden: ○ Choose `0`, i.e. null delimited, gzipped files ■ Easy to write an API for this, for all tools. 2. File Format
  • 71.
    ● Problem: ○ Can’tnecessarily have a single schema for all tools ■ E.g. ● Different type definitions. 3. Schemas for all Tools
  • 72.
    ● Problem: ○ Can’tnecessarily have a single schema for all tools ■ E.g. ● Different type definitions. ● Solution: ○ Define parallel schemas, that have specific types redefined in Hive Metastore ■ E.g. ● Can redefine decimal type to be double for Presto*. ● This parallel schema would be named prod_presto. ○ Still points to same underlying data. 3. Schemas for all Tools * It didn’t use to have functioning decimal support
  • 73.
    ● Problem: ○ Howdo you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column 4. Schema Evolution
  • 74.
    ● Problem: ○ Howdo you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column ● Solution: ○ Append columns to end of schemas. ○ Rename columns as deprecated -- breaks code, but not data. 4. Schema Evolution
  • 75.
    ● Wait, what?Redshift? 5. Redshift
  • 76.
    ● Wait, what?Redshift? ○ Predates use of Spark & Presto ○ Redshift was brought in to help joining data ■ Previously DS had to load data & perform joins in R/Python ○ Data Scientists loved Redshift too much: ■ It became a huge source of contention ■ Have been migrating “production” off of it 5. Redshift
  • 77.
    ● Need: ○ Stillwant to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3+HM in sync with Redshift? 5. Redshift
  • 78.
    ● Need: ○ Stillwant to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3+HM in sync with Redshift? ● Solution: ○ API that abstracts syncing data with Redshift ■ Keeps schemas in sync ■ Uses standard data warehouse staged table insertion pattern 5. Redshift
  • 79.
    ● What doesour integration with Spark look like? 6. Spark
  • 80.
    ● What doesour integration with Spark look like? ○ Running on Amazon EMR using Netflix's Genie ■ Prod & Dev clusters ○ S3 still source of truth ■ Have custom write API: ● Enforces Batch IDs ● Scala based library making use of EMRFS ● Also exposed in Python for PySpark use ○ Heavy users of Spark SQL ○ It’s the main production workhorse 6. Spark
  • 81.
  • 82.
  • 83.
    Data Scientist’s Ad-hocworkflow The faster this iteration cycle, the faster Data Scientists can work
  • 84.
    Data Scientist’s Ad-hocworkflow Scaling this part The faster this iteration cycle, the faster Data Scientists can work
  • 85.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Medium High High Ad hoc Infra: Options Laptop
  • 86.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation High High Ad hoc Infra: Options Laptop Shared Instances
  • 87.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation High Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 88.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation Low Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 89.
    ● Control ofenvironment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Docker?
  • 90.
    ● Has: ○ Ourinternal API libraries ○ Jupyter Hub Notebooks: ■ Pyspark, IPython, R ○ RStudio ○ Scientific C libs pre-installed ● Mounts EFS (NFS) ● Terminal Access (for git, file system, logs, etc): ○ SSH access to container ○ Browser based terminal Ad-Hoc Docker Image
  • 91.
    Self Service Ad-hocInfra: Interactive Notebooks
  • 92.
    Jupyter Hub onInteractive Notebooks
  • 93.
  • 94.
    Browser Based Terminalon Interactive Notebooks
  • 95.
    Interactive Notebooks Deployment ●Amazon ECS for cluster management. ● EC2 Instances: ○ Custom AMI based on ECS optimized docker image. ● Runs in a single Auto Scale Group. ● S3 backed self-hosted Artifactory as docker repository. ● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
    ● Docker tightlyintegrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS (EFS) ● Docker Hub: ○ Weren’t happy with performance ○ Switched to artifactory Docker Problems So Far
  • 102.
  • 103.
    ● S3 +Hive Metastore is Stitch Fix’s very scalable data warehouse. ● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists. ● Docker is used to provide a consistent environment for Data Scientists to use. ● Docker + ECS enables a self-service ad-hoc platform for Data Scientists. In Summary - Scaling DS @ Stitch Fix
  • 104.
    Fin; Thanks! Questions? @stefkrawczyk Tryout Stitch Fix → stitchfix.com/referral/8406746