Big Data for the rest of us
Lawrence Spracklen
SupportLogic
lawrence@supportlogic.io
www.linkedin.com/in/spracklen
SupportLogic
• Extract Signals from enterprise CRM systems
• Applied machine learning
• Complete vertical solution
• Go-live in days!
• We are hiring!
@Scale 2018
• Sound like your Big Data problems?
• This is Extreme data!
• Do these solutions help or hinder Big Data for the rest of us?
“Exabytes of data…..”
“1500 manual labelers…..”
“Sub second global propagation of likes…..”
End-2-End Planning
• Numerous steps/obstacles to successfully leveraging ML
• Data Acquisition
• Data Cleansing
• Feature Engineering
• Model Selection and Training
• Model Optimization
• Model Deployment
• Model Feedback and Retraining
• Import to consider all steps before deciding on an approach
• Upstream decisions can severely limit downstream options
ML Landscape
• How do I build a successful production-grade solution from all these
disparate components that don’t play well together?
Data Set Availability
• Is the necessary data available?
• Are there HIPAA, PII, GDPR concerns?
• Is it spread across multiple systems?
• Can the systems communicate?
• Data fusion
• Move the compute to the data…
• Legacy infrastructure decisions can dictate optimal approach
Feature Engineering
• Essential for model performance, efficacy, robustness and simplicity
• Feature extraction
• Feature selection
• Feature construction
• Feature elimination
• Dimensionality reduction
• Traditionally a laborious manual process
• Automation techniques becoming available
• e.g. TransmogrifAI, Featuretools
• Leverage feature stores!
Model Training
• Big differences in the range of algorithms offered by different
frameworks
• Don’t just jump to the most complex!
• Easy to automate selection process
• Just click ‘go’
• Automate hyperparameter optimization
• Beyond the nested for-loop!
Model Ops
• What happens after the models are created?
• How does the business benefit from the insights
• Operationalization is frequently the weak link
• Operationalizing PowerPoint?
• Hand rolled scoring flows?
Barriers to Model Ops
• Scoring often performed on a different data platform to training
• Framework specific persistence formats
• Complex data preprocessing requirements
• Data cleansing and feature engineering
• Batch training versus RT/stream scoring
• How frequently are models updated?
• How is performance monitored?
Typical Deployments
PMML & PFA
• PMML has been long available as framework agnostic model
representation
• Frequently requires helper scripts
• PFA is the potential successor….
• Addresses lots of PMML’s shortcomings
• Scoring engines accepting R or Python scripts
• Easy to use AWS Lambda!
Interpreting Models
• A prediction without an explanation limits its value
• Why is this outcome being predicted?
• What action should be taken as a result?
• Avoid ML models that are “black Boxes”
• Tools for providing prediction explanations are emerging
• E.g. LIME
Example LIME output
Prototype in Python
• Explore the space!
• Work through the end-2-end solution
• Don’t prematurely optimize
• Great Python tooling
• e.g. Juypter Notebooks, Cloudera Data Science workbench
• Don’t let the data leak to laptops!
Python is slow
• Python is simple, flexible and has massive available
functionality
• Pure Python typically hundreds of times slower than C
• Many Python implementations leverage C under-the-hood
• Even naive Scala or Java implementations are slow
1000X faster….
Everything Python
• Python wrappers are available for most packages
• Even momentum in Spark is moving to Python
• Wrappers for C++ libraries like Shogun
Spark
• Optimizing for speed, data size or both?
• Increasingly rich set of ML algorithms
• Still missing common algorithms
• E.g. Multiclass GBTs
• Not all OSS implementations are good
• Hard to correctly resource Spark jobs
• Autotuning systems available
System Sizing
• Why go multi-node?
• CPU or Memory constraints
• Aggregate data size is very different from the size of the individual data sets
• A Data lake can contain Petabytes, but each dataset may be only 10’s of GB….
• Is the raw data bigger or smaller than final data being consumed by the model?
• Spark for ETL
• Is the algorithm itself parallel?
Single Node ML
• Single node memory on even x86 systems can now measure in
tens of terabytes
• Likely to expand further with NVDIMMs
• 40vCPU, ~1TB x86 only $4/hour on Google Cloud
• Many high performance single-node ML libraries exist!
Hive & Postgres
• On Hadoop, many data scientists are constrained to Hive or
Impala for security reasons
• Can be very limiting for ‘real’ data science
• Hivemall for analytics
• Is a traditional DB a better choice?
• Better performance in many instances
• Apache MadLib for analytics
Conclusions
• No one-size fits all!
• Much more to a successful ML project than a cool model
• Not all frameworks play together
• Decisions can limit downstream options
• Need to think about the problem end-2-end
• From data acquisition to model deployment

Ideas spracklen-final

  • 1.
    Big Data forthe rest of us Lawrence Spracklen SupportLogic lawrence@supportlogic.io www.linkedin.com/in/spracklen
  • 2.
    SupportLogic • Extract Signalsfrom enterprise CRM systems • Applied machine learning • Complete vertical solution • Go-live in days! • We are hiring!
  • 3.
    @Scale 2018 • Soundlike your Big Data problems? • This is Extreme data! • Do these solutions help or hinder Big Data for the rest of us? “Exabytes of data…..” “1500 manual labelers…..” “Sub second global propagation of likes…..”
  • 4.
    End-2-End Planning • Numeroussteps/obstacles to successfully leveraging ML • Data Acquisition • Data Cleansing • Feature Engineering • Model Selection and Training • Model Optimization • Model Deployment • Model Feedback and Retraining • Import to consider all steps before deciding on an approach • Upstream decisions can severely limit downstream options
  • 5.
    ML Landscape • Howdo I build a successful production-grade solution from all these disparate components that don’t play well together?
  • 6.
    Data Set Availability •Is the necessary data available? • Are there HIPAA, PII, GDPR concerns? • Is it spread across multiple systems? • Can the systems communicate? • Data fusion • Move the compute to the data… • Legacy infrastructure decisions can dictate optimal approach
  • 7.
    Feature Engineering • Essentialfor model performance, efficacy, robustness and simplicity • Feature extraction • Feature selection • Feature construction • Feature elimination • Dimensionality reduction • Traditionally a laborious manual process • Automation techniques becoming available • e.g. TransmogrifAI, Featuretools • Leverage feature stores!
  • 8.
    Model Training • Bigdifferences in the range of algorithms offered by different frameworks • Don’t just jump to the most complex! • Easy to automate selection process • Just click ‘go’ • Automate hyperparameter optimization • Beyond the nested for-loop!
  • 9.
    Model Ops • Whathappens after the models are created? • How does the business benefit from the insights • Operationalization is frequently the weak link • Operationalizing PowerPoint? • Hand rolled scoring flows?
  • 10.
    Barriers to ModelOps • Scoring often performed on a different data platform to training • Framework specific persistence formats • Complex data preprocessing requirements • Data cleansing and feature engineering • Batch training versus RT/stream scoring • How frequently are models updated? • How is performance monitored?
  • 11.
  • 12.
    PMML & PFA •PMML has been long available as framework agnostic model representation • Frequently requires helper scripts • PFA is the potential successor…. • Addresses lots of PMML’s shortcomings • Scoring engines accepting R or Python scripts • Easy to use AWS Lambda!
  • 13.
    Interpreting Models • Aprediction without an explanation limits its value • Why is this outcome being predicted? • What action should be taken as a result? • Avoid ML models that are “black Boxes” • Tools for providing prediction explanations are emerging • E.g. LIME
  • 14.
  • 15.
    Prototype in Python •Explore the space! • Work through the end-2-end solution • Don’t prematurely optimize • Great Python tooling • e.g. Juypter Notebooks, Cloudera Data Science workbench • Don’t let the data leak to laptops!
  • 16.
    Python is slow •Python is simple, flexible and has massive available functionality • Pure Python typically hundreds of times slower than C • Many Python implementations leverage C under-the-hood • Even naive Scala or Java implementations are slow
  • 17.
  • 18.
    Everything Python • Pythonwrappers are available for most packages • Even momentum in Spark is moving to Python • Wrappers for C++ libraries like Shogun
  • 19.
    Spark • Optimizing forspeed, data size or both? • Increasingly rich set of ML algorithms • Still missing common algorithms • E.g. Multiclass GBTs • Not all OSS implementations are good • Hard to correctly resource Spark jobs • Autotuning systems available
  • 20.
    System Sizing • Whygo multi-node? • CPU or Memory constraints • Aggregate data size is very different from the size of the individual data sets • A Data lake can contain Petabytes, but each dataset may be only 10’s of GB…. • Is the raw data bigger or smaller than final data being consumed by the model? • Spark for ETL • Is the algorithm itself parallel?
  • 21.
    Single Node ML •Single node memory on even x86 systems can now measure in tens of terabytes • Likely to expand further with NVDIMMs • 40vCPU, ~1TB x86 only $4/hour on Google Cloud • Many high performance single-node ML libraries exist!
  • 22.
    Hive & Postgres •On Hadoop, many data scientists are constrained to Hive or Impala for security reasons • Can be very limiting for ‘real’ data science • Hivemall for analytics • Is a traditional DB a better choice? • Better performance in many instances • Apache MadLib for analytics
  • 23.
    Conclusions • No one-sizefits all! • Much more to a successful ML project than a cool model • Not all frameworks play together • Decisions can limit downstream options • Need to think about the problem end-2-end • From data acquisition to model deployment