Get a Farm-to-Table View of Your Data
Tracking data quality and lineage on-premises
and in the cloud, on and off the cluster
Dr. Tendü Yoğurtçu, Chief Technology Officer
Today’s Speaker
Dr. Tendü Yoğurtçu
Chief Technology Officer, Syncsort
@TenduYogurtcu
www.linkedin.com/in/tenduyogurtcu
2Syncsort Confidential and Proprietary - do not copy or distribute
Farm to Table
3Syncsort Confidential and Proprietary - do not copy or distribute
Technology Trends Advancing Data
4Syncsort Confidential and Proprietary - do not copy or distribute
Advanced
Business &
Operational
Analytics
CLOUD
DATA SCIENCE
& ARTIFICIAL
INTELLIGENCE
IOT &
STREAMING
DATA
DATA
GOVERNANCE
Technology Trends Advancing Data
5Syncsort Confidential and Proprietary - do not copy or distribute
Technology Trends Advancing Data
Advanced
Business &
Operational
Analytics
CLOUD
DATA SCIENCE
& ARTIFICIAL
INTELLIGENCE
IOT &
STREAMING
DATA
DATA
GOVERNANCE
Technology Trends Advancing Data
6Syncsort Confidential and Proprietary - do not copy or distribute
Technology Trends Advancing Data
Advanced
Business &
Operational
Analytics
CLOUD
DATA SCIENCE
& ARTIFICIAL
INTELLIGENCE
IOT &
STREAMING
DATA
DATA
GOVERNANCE
Technology Trends Advancing Data
7Syncsort Confidential and Proprietary - do not copy or distribute
Advanced
Business &
Operational
Analytics
CLOUD
DATA SCIENCE
& ARTIFICIAL
INTELLIGENCE
IOT &
STREAMING
DATA
DATA
GOVERNANCE
Data Governance
8Syncsort Confidential and Proprietary - do not copy or distribute
GOALS
• Regulatory compliance
• Understand data context, meaning
• Accuracy, completeness, consistency, relevancy,
timeliness, validity of data
CHALLENGES
• Multi-platform, data volume and complexity
• Diversity and consistency of sources
• Compliance demands: broader & deeper
 Business imperative across platforms and deployment
models, on-premise and in the cloud
Data Governance
9Syncsort Confidential and Proprietary - do not copy or distribute
QUALITY
• Discover sources of, relationships between, data
• Apply business rules to measure data quality continuously
SECURITY
• Protect the confidentiality, integrity and availability
of data
LINEAGE
• Get insights into where data came from, what changes
were made and where it lands
 Requires a multi-faceted approach
End to End Data Lineage in Cloudera Navigator
10Syncsort Confidential and Proprietary - do not copy or distribute
Data Sources Data analyst
gets end-to-end
data lineage
info from
Navigator.
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Syncsort accesses
data from
sources outside
cluster.
Syncsort changes,
enhances, joins
data in cluster with
MapReduce or
Spark.
Analytics and
visualizations get
complete data.
Navigator gathers
any other changes
made to data on
cluster.
Syncsort passes
source-to-
cluster data
lineage info to
Navigator.
Data Hub
Analytics,
Visualization
Data changes made
by MapReduce,
Spark, HiveQL.
Syncsort DMX-h + Cloudera Navigator for End-to-End Lineage
11Syncsort Confidential and Proprietary - do not copy or distribute
End-to-End Data Lineage in Apache Atlas
12Syncsort Confidential and Proprietary - do not copy or distribute
Data Sources Data analyst
gets end-to-
end data
lineage info
from Atlas
Data Hub
Analytics,
Visualization
Data changes made
by MapReduce,
Spark, HiveQL.
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Syncsort accesses
data from
sources outside
cluster.
Syncsort changes,
enhances, joins
data in cluster with
MapReduce or
Spark.
Analytics and
visualizations get
complete data.
Any other changes
made to data on
cluster are
published to Atlas.
Syncsort passes
source-to-
cluster data
lineage info to
Atlas.
Data Lineage + Data Quality = Foundations of Data Governance
13Syncsort Confidential and Proprietary - do not copy or distribute
Discovery
and
Profiling
Data Sources
Multi-field fuzzy matching, de-duplication,
cleansing, enrichment, standardization,
business rule enforcement.
Analytics and
visualizations on
clean, complete data
you can trust.
Data Hub
Analytics,
Visualization
Data Lineage
Anti-Money Laundering Solution on Hadoop at Large Global Bank
Challenge: Meet AML transaction monitoring
and FCA compliance demands
– Data too large, diversely scattered to analyze
– Disparate data sources -- Mainframe, RDBMS,
Cloud, etc
Requirements:
– Consolidated, clean, verified data for all analytics
and reporting.
– MUST have complete, detailed data lineage from
origin to end point
– MUST be secure: Kerberos and LDAP integration
required
– Need unmodified copy of mainframe data stored
on Hadoop for backup, archive
14Syncsort Confidential and Proprietary - do not copy or distribute
Anti-Money Laundering Solution on Hadoop at Large Global Bank
Solution:
• Syncsort DMX-h to create “Golden Record” on
Hadoop for compliance archiving
• Trillium Quality for Big Data for cluster-native
data verification, enrichment, and demanding
multi-field entity resolution on Spark framework
• Full end-to-end lineage to Cloudera Navigator,
from all sources, through transformations, to
data landing, including HiveQL changes
Benefits:
• New financial crimes data hub produces high
performance results at massive scale
• Bank meets stringent Anti-Money Laundering
compliance requirements
15Syncsort Confidential and Proprietary - do not copy or distribute
Learn How Syncsort Solutions Can Help You
16Syncsort Confidential and Proprietary - do not copy or distribute
Data
Infrastructure Optimization
• Mainframe Optimization
• Application Modernization
• EDW Optimization
• Cross-Platform Capacity
Management
Data
Availability
• High Availability & Disaster
Recovery
• Mission-Critical Migration
• Cross-Platform Data Sharing
• IBM i Data Security & Audit
• Mainframe Access &
Integration for Machine Data
• Mainframe Access &
Integration for App Data
• High-performance ETL
• Change Data Capture
Data
Integration
Data
Quality
• Data Governance
• Customer 360
• Big Data Quality & Integration
• Data Enrichment & Validation
www.syncsort.com
THANK YOU

Get a "Farm to Table" View of Your Data: Tracking Data Quality and Lineage, on Premise and in the Cloud, On and Off the Cluster

  • 1.
    Get a Farm-to-TableView of Your Data Tracking data quality and lineage on-premises and in the cloud, on and off the cluster Dr. Tendü Yoğurtçu, Chief Technology Officer
  • 2.
    Today’s Speaker Dr. TendüYoğurtçu Chief Technology Officer, Syncsort @TenduYogurtcu www.linkedin.com/in/tenduyogurtcu 2Syncsort Confidential and Proprietary - do not copy or distribute
  • 3.
    Farm to Table 3SyncsortConfidential and Proprietary - do not copy or distribute
  • 4.
    Technology Trends AdvancingData 4Syncsort Confidential and Proprietary - do not copy or distribute Advanced Business & Operational Analytics CLOUD DATA SCIENCE & ARTIFICIAL INTELLIGENCE IOT & STREAMING DATA DATA GOVERNANCE
  • 5.
    Technology Trends AdvancingData 5Syncsort Confidential and Proprietary - do not copy or distribute Technology Trends Advancing Data Advanced Business & Operational Analytics CLOUD DATA SCIENCE & ARTIFICIAL INTELLIGENCE IOT & STREAMING DATA DATA GOVERNANCE
  • 6.
    Technology Trends AdvancingData 6Syncsort Confidential and Proprietary - do not copy or distribute Technology Trends Advancing Data Advanced Business & Operational Analytics CLOUD DATA SCIENCE & ARTIFICIAL INTELLIGENCE IOT & STREAMING DATA DATA GOVERNANCE
  • 7.
    Technology Trends AdvancingData 7Syncsort Confidential and Proprietary - do not copy or distribute Advanced Business & Operational Analytics CLOUD DATA SCIENCE & ARTIFICIAL INTELLIGENCE IOT & STREAMING DATA DATA GOVERNANCE
  • 8.
    Data Governance 8Syncsort Confidentialand Proprietary - do not copy or distribute GOALS • Regulatory compliance • Understand data context, meaning • Accuracy, completeness, consistency, relevancy, timeliness, validity of data CHALLENGES • Multi-platform, data volume and complexity • Diversity and consistency of sources • Compliance demands: broader & deeper  Business imperative across platforms and deployment models, on-premise and in the cloud
  • 9.
    Data Governance 9Syncsort Confidentialand Proprietary - do not copy or distribute QUALITY • Discover sources of, relationships between, data • Apply business rules to measure data quality continuously SECURITY • Protect the confidentiality, integrity and availability of data LINEAGE • Get insights into where data came from, what changes were made and where it lands  Requires a multi-faceted approach
  • 10.
    End to EndData Lineage in Cloudera Navigator 10Syncsort Confidential and Proprietary - do not copy or distribute Data Sources Data analyst gets end-to-end data lineage info from Navigator. Syncsort onboards data, modifies on-the-fly to match Hadoop storage model. Syncsort accesses data from sources outside cluster. Syncsort changes, enhances, joins data in cluster with MapReduce or Spark. Analytics and visualizations get complete data. Navigator gathers any other changes made to data on cluster. Syncsort passes source-to- cluster data lineage info to Navigator. Data Hub Analytics, Visualization Data changes made by MapReduce, Spark, HiveQL.
  • 11.
    Syncsort DMX-h +Cloudera Navigator for End-to-End Lineage 11Syncsort Confidential and Proprietary - do not copy or distribute
  • 12.
    End-to-End Data Lineagein Apache Atlas 12Syncsort Confidential and Proprietary - do not copy or distribute Data Sources Data analyst gets end-to- end data lineage info from Atlas Data Hub Analytics, Visualization Data changes made by MapReduce, Spark, HiveQL. Syncsort onboards data, modifies on-the-fly to match Hadoop storage model. Syncsort accesses data from sources outside cluster. Syncsort changes, enhances, joins data in cluster with MapReduce or Spark. Analytics and visualizations get complete data. Any other changes made to data on cluster are published to Atlas. Syncsort passes source-to- cluster data lineage info to Atlas.
  • 13.
    Data Lineage +Data Quality = Foundations of Data Governance 13Syncsort Confidential and Proprietary - do not copy or distribute Discovery and Profiling Data Sources Multi-field fuzzy matching, de-duplication, cleansing, enrichment, standardization, business rule enforcement. Analytics and visualizations on clean, complete data you can trust. Data Hub Analytics, Visualization Data Lineage
  • 14.
    Anti-Money Laundering Solutionon Hadoop at Large Global Bank Challenge: Meet AML transaction monitoring and FCA compliance demands – Data too large, diversely scattered to analyze – Disparate data sources -- Mainframe, RDBMS, Cloud, etc Requirements: – Consolidated, clean, verified data for all analytics and reporting. – MUST have complete, detailed data lineage from origin to end point – MUST be secure: Kerberos and LDAP integration required – Need unmodified copy of mainframe data stored on Hadoop for backup, archive 14Syncsort Confidential and Proprietary - do not copy or distribute
  • 15.
    Anti-Money Laundering Solutionon Hadoop at Large Global Bank Solution: • Syncsort DMX-h to create “Golden Record” on Hadoop for compliance archiving • Trillium Quality for Big Data for cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark framework • Full end-to-end lineage to Cloudera Navigator, from all sources, through transformations, to data landing, including HiveQL changes Benefits: • New financial crimes data hub produces high performance results at massive scale • Bank meets stringent Anti-Money Laundering compliance requirements 15Syncsort Confidential and Proprietary - do not copy or distribute
  • 16.
    Learn How SyncsortSolutions Can Help You 16Syncsort Confidential and Proprietary - do not copy or distribute Data Infrastructure Optimization • Mainframe Optimization • Application Modernization • EDW Optimization • Cross-Platform Capacity Management Data Availability • High Availability & Disaster Recovery • Mission-Critical Migration • Cross-Platform Data Sharing • IBM i Data Security & Audit • Mainframe Access & Integration for Machine Data • Mainframe Access & Integration for App Data • High-performance ETL • Change Data Capture Data Integration Data Quality • Data Governance • Customer 360 • Big Data Quality & Integration • Data Enrichment & Validation www.syncsort.com
  • 17.

Editor's Notes

  • #3 Tendü Yoğurtçu, Ph.D., is Syncsort’s Chief Technology Officer (CTO). She has 20+ years of software industry experience, including extensive Big Data and Hadoop industry knowledge. As CTO, Tendu directs the company’s technology strategy and innovation, leading all product research and development programs. Prior to her CTO role, Tendü has served as Syncsort’s General Manager of Big Data, leading the global software business for Data Integration, Hadoop and Cloud, including sales, marketing, engineering and support. Tendu has held several engineering management roles where she directed the development of ETL, Sort, and Application Modernization products for Syncsort’s Data Integration business. She also was an Adjunct Faculty Member at the Computer Science Department at Stevens Institute of Technology.
  • #4 So, think about this scenario… You’re a chef at a high-end restaurant that has made a name for itself by providing meals of the highest quality made with the freshest organic food. Not only do you need to run a tight ship inside your kitchen – selecting the best ingredients, ensuring high standards for how your food is stored, prepared and cooked… But you also need to know everything about that food BEFORE it was delivered to your kitchen door. Where and how was your food produced? Were your chicken’s free range? Are all the organic certifications valid and up to date? How was the food handled, transported and stored along the way? Was it kept at the proper temperatures to maintain freshness and safety? Did it come into contact with any other foods it shouldn’t have? What about ingredients – like your rice, pasta and spices -- that are imported from other countries? How do you trace them all the way back to where they were harvested or produced? out .. And how long will it take? An end to end view of where your ingredients came from … and what happened to them along the way … is critically important to maintain the restaurant’s reputation for high-quality food. However, this is obviously a complicated and difficult task. While different industries, there are many similarities between the chef’s mission to track her food supply from the farm to the table – and our need to have an end-to-end view of our data. We need to know all the various sources of the data we are working with – how it was moved, transformed, combined along the way – and then what’s happening to it once we are working with it in our Enterprise Data Hub or Data Lake. And, like the restaurant scenario, the task of doing this is critical to complying with regulations and ensuring high quality for our end consumers. And, the task of accomplishing this end to end view is also very challenging.
  • #5 Some of the top technology trends are advancing what we do with data but also make it harder to get this view. Cloud: Rapid growth in cloud data volumes IT is a consumer, provider and broker of cloud services Hybrid is becoming the standard And this all drives the need for Data Governance – but also makes it harder to achieve
  • #6 Data Science & AI: Investment growing Early adopters: Financial Services, Banking, Retail, Telco
  • #7 IoT and Streaming Data: IoT adoption doubled between 2013-2017 Links to analytics, artificial intelligence (AI) and other critical digital initiatives
  • #8 And this all drives the need for Data Governance – but also makes it harder to achieve
  • #11 A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
  • #13 A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
  • #14 As mentioned earlier – lineage is very important, but it’s not the only factor to consider for governance. Data Quality plays a critical role in enterprise data governance as well. To meet the needs of our customers with Hadoop and Spark, we recently released Trillium Quality for Big Data, which runs natively in the cluster to ensure that the data lake doesn’t turn into a data swamp, while harnessing the processing power of the cluster to scale for massive data volumes.
  • #15 Let’s take a look at how this works in the real world…
  • #16 Let’s take a look at how this works in the real world…
  • #19 #1 BIBD solutions provider # countries ?? # partners ??