Visual Diagnostics
at Scale
EuroSciPy 2019
Dr. Rebecca Bilbro
Chief Data Scientist, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro
A tale of
three datasets
Census Dataset
500K instances
50 features
(age, occupation,
education, sex, ethnicity
marital status)
Sarcasm Dataset
50K instances
5K features
(“love”, 🙄, “totally”, “best”,
“surprise”, “Sherlock”,
capitalization, timestamp)
Sensor Dataset
5M instances
15 features
(Ammonia, Acetaldehyde,
Acetone, Ethylene, Ethanol,
Toluene ppmv)
Scaling pain
points are
dataset-
specific
● Many features
● Many instances
● Feature variance
● Heteroskedasticity
● Covariance
● Noise
Logistic Regression Fit Times (seconds)
500 - 5M instances / 5 - 50 features
10 seconds
Multilayer Perceptron Fit Times (seconds)
500 - 5M instances / 5 - 50 features
5 min, 48
seconds
Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
😵
How to
optimize?
● Be patient
● Be wrong
● Be rich
● Steer
Adventures in
Model Visualization
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from yellowbrick.features import ParallelCoordinates
data = load_iris()
oz = ParallelCoordinates(ax=axes[idx], fast=True)
oz.fit_transform(data.data, data.target)
oz.finalize()
Each point drawn individually
as connected line segment
With standardization
Points grouped by class, each class
drawn as single segment
Bumps
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
The scikit-learn API
self.X
class Visualizer(Estimator):
def draw(self):
"""
Draw called from scikit-learn methods.
"""
return self.ax
def finalize(self):
self.set_title()
self.legend()
def poof(self):
self.finalize()
plt.show()
import matplotlib.pyplot as plt
from yellowbrick.base import Visualizer
class MyVisualizer(Visualizer):
def __init__(self, ax=None, **kwargs):
super(MyVisualizer, self).__init__(ax, **kwargs)
def fit(self, X, y=None):
self.draw(X)
return self
def draw(self, X):
if self.ax is None:
self.ax = self.gca()
self.ax.plt(X)
def finalize(self):
self.set_title("My Visualizer")
The Yellowbrick API
Oneliners
from sklearn.linear_model import Lasso
from yellowbrick.regressor import ResidualsPlot
# Option 1: scikit-learn style
viz = ResidualsPlot(Lasso())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
from sklearn.linear_model import Lasso
from yellowbrick.regressor import residuals_plot
# Option 2: Quick Method
viz = residuals_plot(
Lasso(), X_train, y_train, X_test, y_test
)
��
Visual Comparison Tests
=========================================== test session starts ============================================
platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0
rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg
plugins: flakes-4.0.0, cov-2.7.1
collected 932 items
tests/__init__.py s... [ 0%]
tests/base.py s [ 0%]
tests/conftest.py s [ 0%]
tests/fixtures.py s [ 0%]
tests/images.py s [ 0%]
tests/rand.py s [ 0%]
tests/test_base.py s............ [ 2%]
...........................................................................................................
...........................................................................................................
...........................................................................................................
...........................................................................................................
tests/test_utils/test_target.py s............ [ 68%]
tests/test_utils/test_timer.py s..... [ 68%]
tests/test_utils/test_types.py s.................................................................... [ 70%]
....x................................x.............................................................. [ 72%]
.... [ 73%]
tests/test_utils/test_wrapper.py s....
===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
Roadmap
Brushing and Filtering
Ok for only 5 features Not good for 23 features
Machine-learning oriented aggregation
YB (current) Seaborn
Parallelization with joblib
Elbow Curve Validation Curve
from yellowbrick.features import Rank2D
from yellowbrick.pipeline import VisualPipeline
from yellowbrick.model_selection import CVScores
from yellowbrick.regressor import PredictionError
viz_pipe = VisualPipeline([
('rank2d', Rank2D(features=features, algorithm='covariance')),
('prederr', PredictionError(model)),
('cvscores', CVScores(model, cv=cv, scoring='r2'))
])
Visual
Pipelines
Models are aggregations,
so are visualizations...
…so use visualizations
to steer model selection!
scikit-yb.org

EuroSciPy 2019: Visual diagnostics at scale

  • 1.
  • 2.
    Dr. Rebecca Bilbro ChiefData Scientist, ICX Media Co-creator, Scikit-Yellowbrick Author, Applied Text Analysis with Python @rebeccabilbro
  • 3.
  • 4.
    Census Dataset 500K instances 50features (age, occupation, education, sex, ethnicity marital status) Sarcasm Dataset 50K instances 5K features (“love”, 🙄, “totally”, “best”, “surprise”, “Sherlock”, capitalization, timestamp) Sensor Dataset 5M instances 15 features (Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, Toluene ppmv)
  • 5.
    Scaling pain points are dataset- specific ●Many features ● Many instances ● Feature variance ● Heteroskedasticity ● Covariance ● Noise
  • 6.
    Logistic Regression FitTimes (seconds) 500 - 5M instances / 5 - 50 features 10 seconds
  • 7.
    Multilayer Perceptron FitTimes (seconds) 500 - 5M instances / 5 - 50 features 5 min, 48 seconds
  • 8.
    Support Vector MachineFit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds
  • 9.
    Support Vector MachineFit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds 😵
  • 10.
    How to optimize? ● Bepatient ● Be wrong ● Be rich ● Steer
  • 12.
  • 15.
    import matplotlib.pyplot asplt from sklearn.datasets import load_iris from yellowbrick.features import ParallelCoordinates data = load_iris() oz = ParallelCoordinates(ax=axes[idx], fast=True) oz.fit_transform(data.data, data.target) oz.finalize() Each point drawn individually as connected line segment With standardization Points grouped by class, each class drawn as single segment
  • 17.
  • 18.
    class Estimator(object): def fit(self,X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime class Pipeline(Transfomer): @property def named_steps(self): """ Returns a sequence of estimators """ return self.steps @property def _final_estimator(self): """ Terminating estimator """ return self.steps[-1] The scikit-learn API self.X
  • 19.
    class Visualizer(Estimator): def draw(self): """ Drawcalled from scikit-learn methods. """ return self.ax def finalize(self): self.set_title() self.legend() def poof(self): self.finalize() plt.show() import matplotlib.pyplot as plt from yellowbrick.base import Visualizer class MyVisualizer(Visualizer): def __init__(self, ax=None, **kwargs): super(MyVisualizer, self).__init__(ax, **kwargs) def fit(self, X, y=None): self.draw(X) return self def draw(self, X): if self.ax is None: self.ax = self.gca() self.ax.plt(X) def finalize(self): self.set_title("My Visualizer") The Yellowbrick API
  • 20.
    Oneliners from sklearn.linear_model importLasso from yellowbrick.regressor import ResidualsPlot # Option 1: scikit-learn style viz = ResidualsPlot(Lasso()) viz.fit(X_train, y_train) viz.score(X_test, y_test) viz.poof() from sklearn.linear_model import Lasso from yellowbrick.regressor import residuals_plot # Option 2: Quick Method viz = residuals_plot( Lasso(), X_train, y_train, X_test, y_test ) ��
  • 21.
  • 22.
    =========================================== test sessionstarts ============================================ platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0 rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg plugins: flakes-4.0.0, cov-2.7.1 collected 932 items tests/__init__.py s... [ 0%] tests/base.py s [ 0%] tests/conftest.py s [ 0%] tests/fixtures.py s [ 0%] tests/images.py s [ 0%] tests/rand.py s [ 0%] tests/test_base.py s............ [ 2%] ........................................................................................................... ........................................................................................................... ........................................................................................................... ........................................................................................................... tests/test_utils/test_target.py s............ [ 68%] tests/test_utils/test_timer.py s..... [ 68%] tests/test_utils/test_types.py s.................................................................... [ 70%] ....x................................x.............................................................. [ 72%] .... [ 73%] tests/test_utils/test_wrapper.py s.... ===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
  • 23.
  • 24.
    Brushing and Filtering Okfor only 5 features Not good for 23 features
  • 25.
  • 26.
    Parallelization with joblib ElbowCurve Validation Curve
  • 27.
    from yellowbrick.features importRank2D from yellowbrick.pipeline import VisualPipeline from yellowbrick.model_selection import CVScores from yellowbrick.regressor import PredictionError viz_pipe = VisualPipeline([ ('rank2d', Rank2D(features=features, algorithm='covariance')), ('prederr', PredictionError(model)), ('cvscores', CVScores(model, cv=cv, scoring='r2')) ]) Visual Pipelines
  • 28.
    Models are aggregations, soare visualizations...
  • 29.
    …so use visualizations tosteer model selection!
  • 30.