Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt

Behavior-Driven
Development (BDD)
Testing with Apache Spark
Aaron Colcord
Director of Engineering, Data and Analytics
Zachary Nanfelt
Software Engineer, Data and Analytics

Who is
FIS Global?
• We’re FIS Digital Finance, Mobile Data
and Analytics
• One of the largest global
FinTech companies
• Customers are banks and credit unions
• Ecosystem of products and services
built around core banking
•
2

Data Wrangling
ETL is still a thing
The way we do it varies quite a bit
When we do it also varies
The why we do it, that’s easy
1
2
3
4
3

Did we cover all scenarios?
Questions on what we are doing?
Can you prove that the data transformed correctly?
Is unit testing understandable?
Was acceptance criteria met?
Are we able to do all this testing complexity in a reasonable timeframe?
4

What is BDD?
• An extension of Test Driven Development
• Deliberate shared, ubiquitous language
• Automated acceptance tests written as Examples that anyone can read
• Living documentation – Documentation others can read about code is updated
• More agile to have good documentation as an output to development than
making documentation as input
• Enable more Team Members to Participate in the development process
5

Core Problem of Data Transformation
• It is really hard to prove data transformed correctly
in a normal pipeline aka Batch-Oriented
• The traditional way has been to push data through
the system and then query it out
• Apache Spark can accelerate not only the speed you transform,
but the speed in which you can validate transformations
– We can switch from Batch Oriented to Streaming
6

Spark is our favorite hammer
Beautiful Baby
=
7

Super Widget Scenario
• Our app servers log everything in Epoch Time (Unix)
from mobile app clients all over the world
• Users seem incapable of computing this mentally
and want it to appear in their own timezone
• Crazy, but some of these guys are remote
and in different timezones
8

SQL Validation Test Code
SELECT TIMEZONE_OFFSET,
TIMESTAMP_GMT,
TIMESTAMP_LTZ
FROM
APPSTORE_REVIEWS
WHERE
TIMEZONE_OFFSET <> 0
LIMIT 10;
11

Why this is so great
• Collaboration and Participation
• Thinking naturally begets
better scenarios
• We are able to unify
–Use Cases
•We are using ETL...
–All Projects
14

Given a BDD Presentation
When it is late in the day
And FIS is giving the talk
Then get Excited!
• We define Features and Scenarios
Expressive Scenarios
• Given/When/Then
Gherkin doesn’t care how you use them
They just help with readability.
15

Wait, there’s more!
Step Definitions
• Step Definitions tell the how to do
The Feature file said what to do
• This the boundary of the programmer’s
Domain and the business domain.
• It’s not all snake oil, really...
16

Cucumber Step Definition Code
17

Cucumber Step Definition Code
18

For these two guys, ETL wasn’t hell,
it was target practice.
2 Kinds of People in this world… About
20

Enterprise Stuff
• You will notice we are sticking to Eclipse/IntelliJ
• Enterprises usually need to prove Separation of Duties and Audit Trails
• Most Data processing tasks should have an established process to ensure
quality and correctness.
– All Business have their own Custom Approach to Business Rules
– Consolidating these transformations ensure quality. Allows QA Checking
• Notebooks:
– It’s really hard to enforce that consistency and correctness in notebooks, except by
Compiling libraries.
– Unifying Business Logic and Common Transformations removes the prep work.
21

Code
https://dbc-39f78c99-dfb2.cloud.databricks.com/#notebook/28139
22

Tips and Tricks (Pretty Report)
•plugin = {“pretty”, “html:target/cucumber”} isn’t
very pretty
•Use cucumber-reports
23
Verses

Tips and Tricks
(miscellaneous)
• Java cuke > Scala cuke
– intelliJ integration, speed, etc...
• .config(“spark.driver.host”, “127.0.0.1”)
– Saves overall test execution time
for each run
• Think hard about whether things should
get tested at unit or component level
– DAG takes longer to compute path
on more complex DAGs (e.g. longer tests),
but can provide more value
24

Tips and Tricks (Code Coverage for Scala
Cuke)
25

Resources, Resources, Resources
• cucumber.io
• cucumber-reporting
• Pragmatic Books
– Cucumber Book
– Cucumber Recipes
– Cucumber for Java
• Specification by Example
• Databricks Blog
26

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt

More Related Content

What's hot

Similar to Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt

More from Databricks

Recently uploaded

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt