How to Build Modern Data Architectures
Both
On Premises and in the Cloud
Jacque Istok
@jstok
Pivotal Confidential–Internal Use Only
© Copyright 2017 Dell Inc.2
The New Normal
DATA DEVICES
Law
Enforcement
Media
Banks
Delivery
Services
Marketers
Government
Private
Investigators
/Lawyers
Individuals
Employers
Data
Users/Buyers
Analytic
Services
Advertising
Catalog
Co-ops
List
Brokers
Websites
Information
Brokers
Credit
BureausMedia
Archives
Data
Aggregators
FINANCIAL
GOVERNMENT
PHONE/
TV
INTERNETMEDICAL
RETAIL
3© 2017 Pivotal Software, Inc. All rights reserved.
Great organizations leverage software,
analytics, and insights to take better actions
and fundamentally change and pioneer entirely
new operational business models
4© 2017 Pivotal Software, Inc. All rights reserved.
Open Source Innovation
Parallel
Processing
Cloud Native
Continuous
Delivery
Loosely-coupled
Microservices
Data Science and
Machine Learning
Our View on Modern Analytics
© Copyright 2017 Dell Inc.5
Pipeline of a Modern Data Driven App
Data Ecosystem
Business Levers
Apps
MLlib
PL/X
Model Building
Model Tuning
Continuous Model
Improvement
Data Feeds
Ingest
 Filter
 Enrich
 Route
Needs of a Modern Data Architecture
Apps /
Microservices
Messaging /
Integration
Stream /
Event Processing
Data Science /
ML Libraries
Data Lake /
Deep Storage
Distributed
MPP
Analytics
•  MySQL
•  Redis
•  PostgreSQL
•  Cassandra
•  MongoDB
•  Kafka •  Spark Streaming
•  Storm
•  Samza
•  R libraries
•  Python libraries
•  Spark MLlib
•  SAS
•  HDFS
•  AWS S3
•  Azure ADLS
•  Compatible
Hardware
Implementations
•  Amazon EMR
•  Hive
•  Impala
•  Apache HAWQ
•  RedshiftSpring Cloud
Data Flow
What Does It Take To Build Modern Analytics?
Users
User Centered Design
“A design approach that supports the entire development
process with user-centered activities, in order to create a
product that is easy to use and of added value to the
intended users.”
www.usabilitynet.org
Is It Useful?
usage = value
rarely used =
waste
Users
Different Users Want Different Things
IT
●  Tasked with legacy
system integration
●  Controls security access
to comply with policy
and laws
●  Operationalization
●  Enterprise Architecture
Developers
●  Build applications to
interoperate
●  Develop reports and
dashboards
●  Extract and Transform
data
Business Analysts
●  Subject Matter
Experts
●  Primary consumer of
analytical models
●  SQL or BI expert
Data Scientists
●  Mathematically astute
●  Intellectual curiosity,
analytical exploration
●  Domain Knowledge
●  Communication in the
form of visualization
●  SQL and analytical
libraries expert
Analytical
Application
s
Analytical Applications
A Healthy Mix of Old and New
SQL Custom Apps BI/Reporting Machine Learning AI
Native
Interfaces
Native Interfaces
ANSI SQL
●  The Industry Standard to be clear,
less error-prone, and direct
●  Interoperability and consistency
●  It’s everywhere
Native Interfaces
Proprietary SQL
●  Industry Non Standard
●  PostgreSQL PL/PGSQL
●  Teradata SQL
●  Oracle PL/SQL
Linear Systems
•  Sparse and Dense Solvers
•  Linear Algebra
Matrix Factorization
•  Singular Value Decomposition (SVD)
•  Low Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Ordinal Regression
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Robust Variance (Huber-White),
Clustered Variance, Marginal Effects
Other Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Apriori)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Random Forest
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
•  Naïve Bayes
•  Support Vector Machines (SVM)
•  Prediction Metrics
•  K-Nearest Neighbors
Descriptive Statistics
Sketch-Based Estimators
•  CountMin (Cormode-Muth.)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent Values)
Correlation and Covariance
Summary
Utility Modules
Array and Matrix Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Stemming
Sessionization
Pivot
Path Functions
Encoding Categorical Variables
Inferential Statistics
Hypothesis Tests
Time Series
•  ARIMA
May 2017
Graph
•  PageRank
•  Single Source Shortest Path
Native Interfaces
Machine Learning, Statistical, Graph, Path Analytics
Designed for very large graphs
(billions of vertices/edges)
No need to move data and
transform for external graph
engine
Familiar SQL interface
Algorithms:
•  All pairs shortest path*
•  Breadth first traversal*
•  Connected components*
•  Multiple graph measures*
•  PageRank
•  Single source shortest path
Native Interfaces
Graph Analytics
Native Interfaces
Programmatic
•  Current Computing Interfaces
•  User Defined Types
•  User Defined Functions
•  User Defined Aggregates
•  Foundational work for containerized
Python and R compute environments
+ +
GPText:	ANSI	SQL	+	Text	
•  Leveraging	Apache	Solr	and	GPDB	
•  5	years	commercial	producCon	experience	
•  Apache	MadLib	integraCon	for	machine	learning	on	text	data	
•  PL/Python	and	PL/Java	integraCon	for	Natural	Language	Processing	
	
Use	Cases	
•  CommunicaCons	compliance	and	monitoring	
•  Customer	SenCment	analysis	
•  Document	Search	and	Query	
•  Social	Media	Processing,	etc.	
	
Native Interfaces
Text Analytics
Round earth calculations
Current Key Features:
•  Points, Lines, Polygons,
Perimeter, Area, Intersection,
Contains, Distance, Long/Lat
Spatial Indexes & Bounding Boxes
Raster Support
Native Interfaces
GeoSpatial Analytics
Multi
Structured
Data
Structured Data
Multi Structured Data
...
Unstructured / Semi-structured
Sources
&
Pipelines
Analyze, interact, and engage with diverse data sources, localities and temperatures
Real Separation of Compute and Data Source
Hadoop Data Lakes
The image
cannot be
displayed. Your
Public Cloud Data Lakes HybridLocal
Massively Parallel
Analytics Environment
Spring Cloud Data Flow is a Microservices
toolkit for building data integration and
real-time data processing pipelines.
The Data Flow server provides interfaces to
compose and deploy pipelines onto onto
modern runtimes such as Cloud Foundry,
Kubernetes, Apache Mesos or Apache
YARN.
Spring Cloud Data Flow (SCDF)
Ingest - Route - Filter - Enrich
Apache Kafka and SCDF
Data Feeds
Integrated Data Ingest layer
SCDF
(Cloud ETL 2.0)
Flexible
Deploymen
t
Run Your Analytics Anywhere
On-Premises Private Cloud Public Cloud
•  Infrastructure Agnostic: A portable, 100% software solution
•  Same platform, no switching/migration cost
ANALYTICAL
APPLICATIONS
NATIVE INTERFACES
MULTI-
STRUCTURED DATA
SOURCES &
PIPELINES
Structured Data
JDBC, ODBC
SQL
ANSI SQL
USERS
FLEXIBLE
DEPLOYMENT
Local
Storage
Other
RDBMSes
SparkGemFire
Cloud
Object
Storage
HDFS
JSON, Apache AVRO, Apache Parquet, XML, & More
Teradata SQL
Other DB SQL
Apache MADlib
ML/Statistics/Graph
Python. R,
Java, Perl, C
Programmatic
Apache SOLR
Text
PostGIS
GeoSpatial
Custom Apps BI / Reporting Machine Learning AI
IT Dev
Business
Analysts
Data
Scientists
On-Premises
Public
Clouds
Private
Clouds
Fully
Managed
Clouds
MODERN CLOUD
ANALYTICS PLATFORM
KafkaETL
Spring
Cloud
Data Flow
Massively
Parallel
(MPP)
PostgresSQL
Kernel
Petabyte
Scale
Loading
Query
Optimizer
(GPORCA)
Workload
Manager
Polymorphic
Storage
Command
Center
SQL
Compatibility
(Hyper-Q)
Modern Cloud Analytics Platform
© Copyright 2017 Dell Inc.31
FRAUD MANAGEMENT RISK MANAGEMENT
CYBERSECURITY MANUFACTURING
PREDICTIVE MAINTENANCE
ELECTRICITY GRID
Pivotal Greenplum: Not just a Database
An Analytics Solution for every challenge
Pivotal Greenplum: Learn More
Find out more about Pivotal Greenplum at
https://pivotal.io/pivotal-greenplum
OR learn more about the open source at
http://greenplum.org/
OR give it a try yourself at
Amazon AWS or Microsoft Azure or via Download
Thank you!
Jacque Istok
@jstok
Pivotal Confidential–Internal Use Only

How to Build Modern Data Architectures Both On Premises and in the Cloud

  • 1.
    How to BuildModern Data Architectures Both On Premises and in the Cloud Jacque Istok @jstok Pivotal Confidential–Internal Use Only
  • 2.
    © Copyright 2017Dell Inc.2 The New Normal DATA DEVICES Law Enforcement Media Banks Delivery Services Marketers Government Private Investigators /Lawyers Individuals Employers Data Users/Buyers Analytic Services Advertising Catalog Co-ops List Brokers Websites Information Brokers Credit BureausMedia Archives Data Aggregators FINANCIAL GOVERNMENT PHONE/ TV INTERNETMEDICAL RETAIL
  • 3.
    3© 2017 PivotalSoftware, Inc. All rights reserved. Great organizations leverage software, analytics, and insights to take better actions and fundamentally change and pioneer entirely new operational business models
  • 4.
    4© 2017 PivotalSoftware, Inc. All rights reserved. Open Source Innovation Parallel Processing Cloud Native Continuous Delivery Loosely-coupled Microservices Data Science and Machine Learning Our View on Modern Analytics
  • 5.
    © Copyright 2017Dell Inc.5 Pipeline of a Modern Data Driven App Data Ecosystem Business Levers Apps MLlib PL/X Model Building Model Tuning Continuous Model Improvement Data Feeds Ingest Filter Enrich Route
  • 6.
    Needs of aModern Data Architecture Apps / Microservices Messaging / Integration Stream / Event Processing Data Science / ML Libraries Data Lake / Deep Storage Distributed MPP Analytics •  MySQL •  Redis •  PostgreSQL •  Cassandra •  MongoDB •  Kafka •  Spark Streaming •  Storm •  Samza •  R libraries •  Python libraries •  Spark MLlib •  SAS •  HDFS •  AWS S3 •  Azure ADLS •  Compatible Hardware Implementations •  Amazon EMR •  Hive •  Impala •  Apache HAWQ •  RedshiftSpring Cloud Data Flow
  • 7.
    What Does ItTake To Build Modern Analytics?
  • 8.
  • 9.
    User Centered Design “Adesign approach that supports the entire development process with user-centered activities, in order to create a product that is easy to use and of added value to the intended users.” www.usabilitynet.org
  • 10.
    Is It Useful? usage= value rarely used = waste
  • 11.
    Users Different Users WantDifferent Things IT ●  Tasked with legacy system integration ●  Controls security access to comply with policy and laws ●  Operationalization ●  Enterprise Architecture Developers ●  Build applications to interoperate ●  Develop reports and dashboards ●  Extract and Transform data Business Analysts ●  Subject Matter Experts ●  Primary consumer of analytical models ●  SQL or BI expert Data Scientists ●  Mathematically astute ●  Intellectual curiosity, analytical exploration ●  Domain Knowledge ●  Communication in the form of visualization ●  SQL and analytical libraries expert
  • 12.
  • 13.
    Analytical Applications A HealthyMix of Old and New SQL Custom Apps BI/Reporting Machine Learning AI
  • 14.
  • 15.
    Native Interfaces ANSI SQL ● The Industry Standard to be clear, less error-prone, and direct ●  Interoperability and consistency ●  It’s everywhere
  • 16.
    Native Interfaces Proprietary SQL ● Industry Non Standard ●  PostgreSQL PL/PGSQL ●  Teradata SQL ●  Oracle PL/SQL
  • 17.
    Linear Systems •  Sparseand Dense Solvers •  Linear Algebra Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Ordinal Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered Variance, Marginal Effects Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM) •  Prediction Metrics •  K-Nearest Neighbors Descriptive Statistics Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation and Covariance Summary Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming Sessionization Pivot Path Functions Encoding Categorical Variables Inferential Statistics Hypothesis Tests Time Series •  ARIMA May 2017 Graph •  PageRank •  Single Source Shortest Path Native Interfaces Machine Learning, Statistical, Graph, Path Analytics
  • 18.
    Designed for verylarge graphs (billions of vertices/edges) No need to move data and transform for external graph engine Familiar SQL interface Algorithms: •  All pairs shortest path* •  Breadth first traversal* •  Connected components* •  Multiple graph measures* •  PageRank •  Single source shortest path Native Interfaces Graph Analytics
  • 19.
    Native Interfaces Programmatic •  CurrentComputing Interfaces •  User Defined Types •  User Defined Functions •  User Defined Aggregates •  Foundational work for containerized Python and R compute environments + +
  • 20.
    GPText: ANSI SQL + Text •  Leveraging Apache Solr and GPDB •  5 years commercial producCon experience • Apache MadLib integraCon for machine learning on text data •  PL/Python and PL/Java integraCon for Natural Language Processing Use Cases •  CommunicaCons compliance and monitoring •  Customer SenCment analysis •  Document Search and Query •  Social Media Processing, etc. Native Interfaces Text Analytics
  • 21.
    Round earth calculations CurrentKey Features: •  Points, Lines, Polygons, Perimeter, Area, Intersection, Contains, Distance, Long/Lat Spatial Indexes & Bounding Boxes Raster Support Native Interfaces GeoSpatial Analytics
  • 22.
  • 23.
    Structured Data Multi StructuredData ... Unstructured / Semi-structured
  • 24.
  • 25.
    Analyze, interact, andengage with diverse data sources, localities and temperatures Real Separation of Compute and Data Source Hadoop Data Lakes The image cannot be displayed. Your Public Cloud Data Lakes HybridLocal Massively Parallel Analytics Environment
  • 26.
    Spring Cloud DataFlow is a Microservices toolkit for building data integration and real-time data processing pipelines. The Data Flow server provides interfaces to compose and deploy pipelines onto onto modern runtimes such as Cloud Foundry, Kubernetes, Apache Mesos or Apache YARN. Spring Cloud Data Flow (SCDF) Ingest - Route - Filter - Enrich
  • 27.
    Apache Kafka andSCDF Data Feeds Integrated Data Ingest layer SCDF (Cloud ETL 2.0)
  • 28.
  • 29.
    Run Your AnalyticsAnywhere On-Premises Private Cloud Public Cloud •  Infrastructure Agnostic: A portable, 100% software solution •  Same platform, no switching/migration cost
  • 30.
    ANALYTICAL APPLICATIONS NATIVE INTERFACES MULTI- STRUCTURED DATA SOURCES& PIPELINES Structured Data JDBC, ODBC SQL ANSI SQL USERS FLEXIBLE DEPLOYMENT Local Storage Other RDBMSes SparkGemFire Cloud Object Storage HDFS JSON, Apache AVRO, Apache Parquet, XML, & More Teradata SQL Other DB SQL Apache MADlib ML/Statistics/Graph Python. R, Java, Perl, C Programmatic Apache SOLR Text PostGIS GeoSpatial Custom Apps BI / Reporting Machine Learning AI IT Dev Business Analysts Data Scientists On-Premises Public Clouds Private Clouds Fully Managed Clouds MODERN CLOUD ANALYTICS PLATFORM KafkaETL Spring Cloud Data Flow Massively Parallel (MPP) PostgresSQL Kernel Petabyte Scale Loading Query Optimizer (GPORCA) Workload Manager Polymorphic Storage Command Center SQL Compatibility (Hyper-Q) Modern Cloud Analytics Platform
  • 31.
    © Copyright 2017Dell Inc.31
  • 32.
    FRAUD MANAGEMENT RISKMANAGEMENT CYBERSECURITY MANUFACTURING PREDICTIVE MAINTENANCE ELECTRICITY GRID Pivotal Greenplum: Not just a Database An Analytics Solution for every challenge
  • 33.
    Pivotal Greenplum: LearnMore Find out more about Pivotal Greenplum at https://pivotal.io/pivotal-greenplum OR learn more about the open source at http://greenplum.org/ OR give it a try yourself at Amazon AWS or Microsoft Azure or via Download
  • 34.
    Thank you! Jacque Istok @jstok PivotalConfidential–Internal Use Only