Behavior-Driven
Development (BDD)
Testing with Apache Spark
Aaron Colcord
Director of Engineering, Data and Analytics
Zachary Nanfelt
Software Engineer, Data and Analytics
Who is
FIS Global?
• We’re FIS Digital Finance, Mobile Data
and Analytics
• One of the largest global
FinTech companies
• Customers are banks and credit unions
• Ecosystem of products and services
built around core banking
•
2
Data Wrangling
ETL is still a thing
The way we do it varies quite a bit
When we do it also varies
The why we do it, that’s easy
1
2
3
4
3
Did we cover all scenarios?
Questions on what we are doing?
Can you prove that the data transformed correctly?
Is unit testing understandable?
Was acceptance criteria met?
Are we able to do all this testing complexity in a reasonable timeframe?
4
What is BDD?
• An extension of Test Driven Development
• Deliberate shared, ubiquitous language
• Automated acceptance tests written as Examples that anyone can read
• Living documentation – Documentation others can read about code is updated
• More agile to have good documentation as an output to development than
making documentation as input
• Enable more Team Members to Participate in the development process
5
Core Problem of Data Transformation
• It is really hard to prove data transformed correctly
in a normal pipeline aka Batch-Oriented
• The traditional way has been to push data through
the system and then query it out
• Apache Spark can accelerate not only the speed you transform,
but the speed in which you can validate transformations
– We can switch from Batch Oriented to Streaming
6
Spark is our favorite hammer
Beautiful Baby
=
7
Super Widget Scenario
• Our app servers log everything in Epoch Time (Unix)
from mobile app clients all over the world
• Users seem incapable of computing this mentally
and want it to appear in their own timezone
• Crazy, but some of these guys are remote
and in different timezones
8
Boilerplate Code
9
Extraction
Code
10
SQL Validation Test Code
SELECT TIMEZONE_OFFSET,
TIMESTAMP_GMT,
TIMESTAMP_LTZ
FROM
APPSTORE_REVIEWS
WHERE
TIMEZONE_OFFSET <> 0
LIMIT 10;
11
Cucumber Test
12
Cucumber Test
Version 2
13
Why this is so great
• Collaboration and Participation
• Thinking naturally begets
better scenarios
• We are able to unify
–Use Cases
•We are using ETL...
–All Projects
14
Given a BDD Presentation
When it is late in the day
And FIS is giving the talk
Then get Excited!
• We define Features and Scenarios
Expressive Scenarios
• Given/When/Then
Gherkin doesn’t care how you use them
They just help with readability.
15
Wait, there’s more!
Step Definitions
• Step Definitions tell the how to do
The Feature file said what to do
• This the boundary of the programmer’s
Domain and the business domain.
• It’s not all snake oil, really...
16
Cucumber Step Definition Code
17
Cucumber Step Definition Code
18
Cucumber Code
19
For these two guys, ETL wasn’t hell,
it was target practice.
2 Kinds of People in this world… About
20
Enterprise Stuff
• You will notice we are sticking to Eclipse/IntelliJ
• Enterprises usually need to prove Separation of Duties and Audit Trails
• Most Data processing tasks should have an established process to ensure
quality and correctness.
– All Business have their own Custom Approach to Business Rules
– Consolidating these transformations ensure quality. Allows QA Checking
• Notebooks:
– It’s really hard to enforce that consistency and correctness in notebooks, except by
Compiling libraries.
– Unifying Business Logic and Common Transformations removes the prep work.
21
Code
https://dbc-39f78c99-dfb2.cloud.databricks.com/#notebook/28139
22
Tips and Tricks (Pretty Report)
•plugin = {“pretty”, “html:target/cucumber”} isn’t
very pretty
•Use cucumber-reports
23
Verses
Tips and Tricks
(miscellaneous)
• Java cuke > Scala cuke
– intelliJ integration, speed, etc...
• .config(“spark.driver.host”, “127.0.0.1”)
– Saves overall test execution time
for each run
• Think hard about whether things should
get tested at unit or component level
– DAG takes longer to compute path
on more complex DAGs (e.g. longer tests),
but can provide more value
24
Tips and Tricks (Code Coverage for Scala
Cuke)
25
Resources, Resources, Resources
• cucumber.io
• cucumber-reporting
• Pragmatic Books
– Cucumber Book
– Cucumber Recipes
– Cucumber for Java
• Specification by Example
• Databricks Blog
26
Thank you

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt

  • 1.
    Behavior-Driven Development (BDD) Testing withApache Spark Aaron Colcord Director of Engineering, Data and Analytics Zachary Nanfelt Software Engineer, Data and Analytics
  • 2.
    Who is FIS Global? •We’re FIS Digital Finance, Mobile Data and Analytics • One of the largest global FinTech companies • Customers are banks and credit unions • Ecosystem of products and services built around core banking • 2
  • 3.
    Data Wrangling ETL isstill a thing The way we do it varies quite a bit When we do it also varies The why we do it, that’s easy 1 2 3 4 3
  • 4.
    Did we coverall scenarios? Questions on what we are doing? Can you prove that the data transformed correctly? Is unit testing understandable? Was acceptance criteria met? Are we able to do all this testing complexity in a reasonable timeframe? 4
  • 5.
    What is BDD? •An extension of Test Driven Development • Deliberate shared, ubiquitous language • Automated acceptance tests written as Examples that anyone can read • Living documentation – Documentation others can read about code is updated • More agile to have good documentation as an output to development than making documentation as input • Enable more Team Members to Participate in the development process 5
  • 6.
    Core Problem ofData Transformation • It is really hard to prove data transformed correctly in a normal pipeline aka Batch-Oriented • The traditional way has been to push data through the system and then query it out • Apache Spark can accelerate not only the speed you transform, but the speed in which you can validate transformations – We can switch from Batch Oriented to Streaming 6
  • 7.
    Spark is ourfavorite hammer Beautiful Baby = 7
  • 8.
    Super Widget Scenario •Our app servers log everything in Epoch Time (Unix) from mobile app clients all over the world • Users seem incapable of computing this mentally and want it to appear in their own timezone • Crazy, but some of these guys are remote and in different timezones 8
  • 9.
  • 10.
  • 11.
    SQL Validation TestCode SELECT TIMEZONE_OFFSET, TIMESTAMP_GMT, TIMESTAMP_LTZ FROM APPSTORE_REVIEWS WHERE TIMEZONE_OFFSET <> 0 LIMIT 10; 11
  • 12.
  • 13.
  • 14.
    Why this isso great • Collaboration and Participation • Thinking naturally begets better scenarios • We are able to unify –Use Cases •We are using ETL... –All Projects 14
  • 15.
    Given a BDDPresentation When it is late in the day And FIS is giving the talk Then get Excited! • We define Features and Scenarios Expressive Scenarios • Given/When/Then Gherkin doesn’t care how you use them They just help with readability. 15
  • 16.
    Wait, there’s more! StepDefinitions • Step Definitions tell the how to do The Feature file said what to do • This the boundary of the programmer’s Domain and the business domain. • It’s not all snake oil, really... 16
  • 17.
  • 18.
  • 19.
  • 20.
    For these twoguys, ETL wasn’t hell, it was target practice. 2 Kinds of People in this world… About 20
  • 21.
    Enterprise Stuff • Youwill notice we are sticking to Eclipse/IntelliJ • Enterprises usually need to prove Separation of Duties and Audit Trails • Most Data processing tasks should have an established process to ensure quality and correctness. – All Business have their own Custom Approach to Business Rules – Consolidating these transformations ensure quality. Allows QA Checking • Notebooks: – It’s really hard to enforce that consistency and correctness in notebooks, except by Compiling libraries. – Unifying Business Logic and Common Transformations removes the prep work. 21
  • 22.
  • 23.
    Tips and Tricks(Pretty Report) •plugin = {“pretty”, “html:target/cucumber”} isn’t very pretty •Use cucumber-reports 23 Verses
  • 24.
    Tips and Tricks (miscellaneous) •Java cuke > Scala cuke – intelliJ integration, speed, etc... • .config(“spark.driver.host”, “127.0.0.1”) – Saves overall test execution time for each run • Think hard about whether things should get tested at unit or component level – DAG takes longer to compute path on more complex DAGs (e.g. longer tests), but can provide more value 24
  • 25.
    Tips and Tricks(Code Coverage for Scala Cuke) 25
  • 26.
    Resources, Resources, Resources •cucumber.io • cucumber-reporting • Pragmatic Books – Cucumber Book – Cucumber Recipes – Cucumber for Java • Specification by Example • Databricks Blog 26
  • 27.