A Better Architecture for Data:
Adaptable, Scalable, and Smart
Paul Boal &
Adam Doyle
June 8, 2018ST LOUIS
Agenda
1. Modern Data Architecture Myths
2. Characteristics of Modern Data Architecture
a. Governed, Secure
b. Adaptable, Customer Centric, Collaborative
c. Flexible, Elastic, Simple, Resilient
d. Smart, Automated
3. Reference Data Architecture
4. How do I get there?
5. Recap
2
Myths
3
MYTH #1
A modern data architecture is
not a single technology or single
vendor solution.
Modern data architectures
combine a portfolio of
technologies to create an
ecosystem with certain
characteristics.
Just install
Hadoop
4
MYTH #2
NoSQL technologies provide an
efficient way to manage and
access data under certain
circumstances, but traditional
relational databases and SQL
continue to provide the most
powerful way to organize and
query well-known data.
Modern must
mean NoSQL
5
MYTH #3
We talk a lot about the
accelerating growth of data, the
decreasing cost of storage and
compute power, and the power
of data science. It's convenient
to believe that throwing all of
this into a pot and simmering
will produce results while we
wait. The truth is that applying
data, technology, and analytics
still requires planning, analysis,
and careful execution.
Big data is
magical pixie
dust
6
MYTH #4
Not all data is created equal.
Sometimes you might have
unreliable or invalid data that
will obfuscate results if used
inappropriately.
Using extraneous data can
make analysis more
complicated by adding time to
filter the data set and select
features. Sometimes more just
means more work.
More data is
always better
7
MYTH #5
One of the characteristics of a
modern data architecture is
flexibility, meaning that your
modernization should be
developed incrementally,
implementing new capabilities
in a way that integrates with
and slowly supplants existing
limited technologies.
I have to
replace
everything I
have right now
8
Characteristics
9
Governed, Secure
10
Governed,
11
The architecture and its
components have to evolve and
adapt in ways that are intentional
and informed by enterprise
strategy.
Make collaboration the default.
Communicate and then
communicate some more.
Treat every component as if another
team may want to use it, too.
Accessing information should be
easy and should effortlessly ensure
that users are knowingly using the
right information for the right
purpose.
Security as an enabler of usage, not a
denier of access.
Track and log access for audit
purposes and for learning.
Secure
ING
Apache Atlas
Open Metadata and
Governance - APIs,
notification systems,
integration of metadata,
security, and governance
related tools
12
Governed, Secure
https://www.slideshare.net/Hadoop_Summit/open-metadata-and-governance-with-apache-atlas?qid=6ea30d4f-15af-46ad-b580-349f78bb7752&v=&b=&from_search=9
Frameworks and Tools
Open Source Core
Apache Atlas - Open Metadata Management
Apache NiFi - Data Provenance
Apache Sentry/Ranger - Fine-grained Access
Control
13
Governed, Secure
Vendor Participants
Adaptable, Customer Centric, Collaborative
It is not the strongest of the species that
survives, nor the most intelligent. It is the
one that is most adaptable to change.
~Charles Darwin
14
Adaptable,
15
The more you deliver, the
more you will learn about
what is really needed, so
be prepared to change and
build solutions that can
change easily.
Agile data modeling.
Agile analytics.
Focus on delivering solutions
that make sense to the people
who will use them rather than
following standards and rules
above all else.
The DBMS is not your user.
Ralph Kimball and Edgar Codd
are not your users.
The Architecture Review Board
is not your user.
Customer Centric,
Solutions that are interactively
designed and built by a team with
diverse capabilities and backgrounds
can produce a result better than what
any one individual would have done .
Collaboration is more than
requirements gathering.
Collaboration is something that has to
happen every day.
Communicate, communicate,
communicate. And then communicate.
Collaborative
Agile Data
16
Adaptable, Customer Centric, Collaborative
http://agiledata.org/
Tools and Techniques
Model Storming
Rapid experimentation
Data science environments
Wherescape, Snowflake, ThoughtSpot
17
Adaptable, Customer Centric, Collaborative
Simple, Elastic, Resilient, Flexible
Notice that the stiffest tree is
most easily cracked while the
bamboo or willow survives by
bending with the wind.
-Bruce Lee
18
Simple,
19
Individual
components should
only be as complex as
necessary.
Reduce inter-
dependencies.
Use shared
components.
The system can easily
had an increase in
data volume, users,
or complexity.
Distributed computing.
Cloud.
DevOps.
Errors in data or
processing don't
cause large parts of
the system to fail.
Isolate components.
Tolerate, isolate, and
report bad data.
Change to the system
is easy to
accommodate and
doesn't break other
components.
Microservices.
Versioned interfaces.
Backward
compatibility.
Elastic, Resilient, Flexible
EarEcstasy
20
Data staging and
Data Lake only
contain needed data.
Each data pipeline is
only as complex as it
needs to be to deliver
on a narrow scope.
Data is only
integrated as
needed, keeping
processes simple.
Simple, Elastic, Resilient, Flexible
https://www.slideshare.net/AmazonWebServices/aws-summit-singapore-get-to-know-your-customers-modern-data-architecture-93784711
Tools and Technologies
21
Cloud-based Infrastructure
Cloud-native Services
DevOps
Containers
Open Source
Simple, Elastic, Resilient, Flexible
Automated, Smart
22
I'm afraid I can't make
that into a star schema,
Dave.
We are going through the process where
software will automate software, automation
will automate automation.
-Mark Cuban
Automated,
23
Automate tasks needed to optimize
the function of the system, to
detect significant changes, and to
alert users when attention is
needed.
Metadata injection.
Schema change detection.
Anomaly detection.
Alerting
Schema detection. Self-tuning
databases. Jeopardy champion.
Data shaping, data quality
recommendations.
Natural Language Processing.
Machine Learning.
Recommender systems.
Deep Learning.
Smart
EXAMPLE
83%
reduction in
workload
matching
complex,
low quality
data with
contextual
analysis
24
Automated, Smart
TOOLS
Integrated Machine Learning
Integrated Search
Intelligent Data Classification
Natural Language Processing
25
Automated, Smart
Reference Architecture
26
Modern Data Architecture
27
Everything should be made as
simple as possible, but not simpler.
- A. Einstein
Next steps
29
How do I get there from here?
30
Start with something you understand well from a business perspective.
Select specific, valuable, measurable business cases.
Add simple machine learning use cases.
Identify use cases to move from a batch processing system to a streaming solution.
Recap
31
The Myths are Just Myths
32
● You don't "just need Hadoop" -
You may not even need Hadoop at all!
● NoSQL has a place, but that isn't the entire solution either.
● There's no magical pixie dust here.
This transformation will take real work.
● More data is not necessarily better -
no matter how much we data hoarders want it to be.
● By definition, you have to incrementally create your modern data
architecture, because it also has to continue to evolve.
Governed, Secure
33
Maintain data and the data architecture in
a way that makes governance and security
a natural and easy part of doing work.
Adaptable, Customer Centric, Collaborative
34
Apply data toward real
challenges and opportunities that
focus on customers and be willing
and able to pivot as needed.
Simple, Elastic, Resilient, Flexible
35
Build your data architecture, your teams,
and your processes in a way that creates a
high capacity for change.
Automated, Smart
36
Create systems that can do more of
the work of ingestion, storage, and
integration without your intervention.
Thank You!
37

Better Architecture for Data: Adaptable, Scalable, and Smart

  • 1.
    A Better Architecturefor Data: Adaptable, Scalable, and Smart Paul Boal & Adam Doyle June 8, 2018ST LOUIS
  • 2.
    Agenda 1. Modern DataArchitecture Myths 2. Characteristics of Modern Data Architecture a. Governed, Secure b. Adaptable, Customer Centric, Collaborative c. Flexible, Elastic, Simple, Resilient d. Smart, Automated 3. Reference Data Architecture 4. How do I get there? 5. Recap 2
  • 3.
  • 4.
    MYTH #1 A moderndata architecture is not a single technology or single vendor solution. Modern data architectures combine a portfolio of technologies to create an ecosystem with certain characteristics. Just install Hadoop 4
  • 5.
    MYTH #2 NoSQL technologiesprovide an efficient way to manage and access data under certain circumstances, but traditional relational databases and SQL continue to provide the most powerful way to organize and query well-known data. Modern must mean NoSQL 5
  • 6.
    MYTH #3 We talka lot about the accelerating growth of data, the decreasing cost of storage and compute power, and the power of data science. It's convenient to believe that throwing all of this into a pot and simmering will produce results while we wait. The truth is that applying data, technology, and analytics still requires planning, analysis, and careful execution. Big data is magical pixie dust 6
  • 7.
    MYTH #4 Not alldata is created equal. Sometimes you might have unreliable or invalid data that will obfuscate results if used inappropriately. Using extraneous data can make analysis more complicated by adding time to filter the data set and select features. Sometimes more just means more work. More data is always better 7
  • 8.
    MYTH #5 One ofthe characteristics of a modern data architecture is flexibility, meaning that your modernization should be developed incrementally, implementing new capabilities in a way that integrates with and slowly supplants existing limited technologies. I have to replace everything I have right now 8
  • 9.
  • 10.
  • 11.
    Governed, 11 The architecture andits components have to evolve and adapt in ways that are intentional and informed by enterprise strategy. Make collaboration the default. Communicate and then communicate some more. Treat every component as if another team may want to use it, too. Accessing information should be easy and should effortlessly ensure that users are knowingly using the right information for the right purpose. Security as an enabler of usage, not a denier of access. Track and log access for audit purposes and for learning. Secure
  • 12.
    ING Apache Atlas Open Metadataand Governance - APIs, notification systems, integration of metadata, security, and governance related tools 12 Governed, Secure https://www.slideshare.net/Hadoop_Summit/open-metadata-and-governance-with-apache-atlas?qid=6ea30d4f-15af-46ad-b580-349f78bb7752&v=&b=&from_search=9
  • 13.
    Frameworks and Tools OpenSource Core Apache Atlas - Open Metadata Management Apache NiFi - Data Provenance Apache Sentry/Ranger - Fine-grained Access Control 13 Governed, Secure Vendor Participants
  • 14.
    Adaptable, Customer Centric,Collaborative It is not the strongest of the species that survives, nor the most intelligent. It is the one that is most adaptable to change. ~Charles Darwin 14
  • 15.
    Adaptable, 15 The more youdeliver, the more you will learn about what is really needed, so be prepared to change and build solutions that can change easily. Agile data modeling. Agile analytics. Focus on delivering solutions that make sense to the people who will use them rather than following standards and rules above all else. The DBMS is not your user. Ralph Kimball and Edgar Codd are not your users. The Architecture Review Board is not your user. Customer Centric, Solutions that are interactively designed and built by a team with diverse capabilities and backgrounds can produce a result better than what any one individual would have done . Collaboration is more than requirements gathering. Collaboration is something that has to happen every day. Communicate, communicate, communicate. And then communicate. Collaborative
  • 16.
    Agile Data 16 Adaptable, CustomerCentric, Collaborative http://agiledata.org/
  • 17.
    Tools and Techniques ModelStorming Rapid experimentation Data science environments Wherescape, Snowflake, ThoughtSpot 17 Adaptable, Customer Centric, Collaborative
  • 18.
    Simple, Elastic, Resilient,Flexible Notice that the stiffest tree is most easily cracked while the bamboo or willow survives by bending with the wind. -Bruce Lee 18
  • 19.
    Simple, 19 Individual components should only beas complex as necessary. Reduce inter- dependencies. Use shared components. The system can easily had an increase in data volume, users, or complexity. Distributed computing. Cloud. DevOps. Errors in data or processing don't cause large parts of the system to fail. Isolate components. Tolerate, isolate, and report bad data. Change to the system is easy to accommodate and doesn't break other components. Microservices. Versioned interfaces. Backward compatibility. Elastic, Resilient, Flexible
  • 20.
    EarEcstasy 20 Data staging and DataLake only contain needed data. Each data pipeline is only as complex as it needs to be to deliver on a narrow scope. Data is only integrated as needed, keeping processes simple. Simple, Elastic, Resilient, Flexible https://www.slideshare.net/AmazonWebServices/aws-summit-singapore-get-to-know-your-customers-modern-data-architecture-93784711
  • 21.
    Tools and Technologies 21 Cloud-basedInfrastructure Cloud-native Services DevOps Containers Open Source Simple, Elastic, Resilient, Flexible
  • 22.
    Automated, Smart 22 I'm afraidI can't make that into a star schema, Dave. We are going through the process where software will automate software, automation will automate automation. -Mark Cuban
  • 23.
    Automated, 23 Automate tasks neededto optimize the function of the system, to detect significant changes, and to alert users when attention is needed. Metadata injection. Schema change detection. Anomaly detection. Alerting Schema detection. Self-tuning databases. Jeopardy champion. Data shaping, data quality recommendations. Natural Language Processing. Machine Learning. Recommender systems. Deep Learning. Smart
  • 24.
  • 25.
    TOOLS Integrated Machine Learning IntegratedSearch Intelligent Data Classification Natural Language Processing 25 Automated, Smart
  • 26.
  • 27.
    Modern Data Architecture 27 Everythingshould be made as simple as possible, but not simpler. - A. Einstein
  • 29.
  • 30.
    How do Iget there from here? 30 Start with something you understand well from a business perspective. Select specific, valuable, measurable business cases. Add simple machine learning use cases. Identify use cases to move from a batch processing system to a streaming solution.
  • 31.
  • 32.
    The Myths areJust Myths 32 ● You don't "just need Hadoop" - You may not even need Hadoop at all! ● NoSQL has a place, but that isn't the entire solution either. ● There's no magical pixie dust here. This transformation will take real work. ● More data is not necessarily better - no matter how much we data hoarders want it to be. ● By definition, you have to incrementally create your modern data architecture, because it also has to continue to evolve.
  • 33.
    Governed, Secure 33 Maintain dataand the data architecture in a way that makes governance and security a natural and easy part of doing work.
  • 34.
    Adaptable, Customer Centric,Collaborative 34 Apply data toward real challenges and opportunities that focus on customers and be willing and able to pivot as needed.
  • 35.
    Simple, Elastic, Resilient,Flexible 35 Build your data architecture, your teams, and your processes in a way that creates a high capacity for change.
  • 36.
    Automated, Smart 36 Create systemsthat can do more of the work of ingestion, storage, and integration without your intervention.
  • 37.

Editor's Notes

  • #3 Intro and Myths - Paul Characteristics A, B - Paul Characteristics C, D - Adam Reference Architecture - Adam How do I Get There - Adam or Paul or Back-and-Forth Recap - Paul
  • #11 These characteristics describe the processes by which your data is maintained. Maybe here we want to tell stories about companies that didn’t secure their data (Target, Equifax, Schnucks)
  • #12 These characteristics describe the processes by which your data is maintained. Maybe here we want to tell stories about companies that didn’t secure their data (Target, Equifax, Schnucks)
  • #13 These characteristics describe the processes by which your data is maintained.
  • #14 These characteristics describe the processes by which your data is maintained.
  • #15 These characteristics describe the way in which you use your data. Built for purpose
  • #16 These characteristics describe the way in which you use your data. Built for purpose
  • #17 These characteristics describe the way in which you use your data.
  • #18 These characteristics describe the way in which you use your data.
  • #20 These characteristics describe the architecture and its capacity to change.
  • #21 These characteristics describe the architecture and its capacity to change.
  • #22 These characteristics describe the architecture and its capacity to change.
  • #23 These characteristics describe the way in which your data is integrated. Informatica ClAIre
  • #24 These characteristics describe the way in which your data is integrated. Informatica ClAIre
  • #25 These characteristics describe the way in which your data is integrated.
  • #26 These characteristics describe the way in which your data is integrated.
  • #28 These characteristics describe the architecture and its capacity to change.
  • #29 Processing data - Mastering, Integration, De-identification, Data Warehouse/Data Mart for reporting with rigor Provisioning - Pie in the Sky - I’d like some “Net Sales”