TESTING
BIG DATA
SOLUTIONS
FAST AND
FURIOUSLY
ABOUT ME
Dmitriy Sobko
Lead QA
Zoral
dmitriy.sobko@gmail.com
AGENDA
• Big Data
• BI / ETL
• DWH
• Cloud
• Testing concepts
• Framework example
First, we had data. Now
we have big data.
The more data there is,
the more you know about
things and the sharper
your decisions become
WHAT IS BIG DATA
BUSINESS INTELLIGENCE (BI)
• Know your data to make better
decisions
• Set of practices, architectures
and technologies for
gathering, processing and
analyzing the data
BI. CLOSER VIEW
• Daily transactions and correspondences are
recorded
• Records are collected in databases
• Data are processed and transformed into
usable information
• Information is analyzed to generate insight
ETL
• Extracts data from the multiple
and disparate source systems
such as records databases
• Transforms this data into usable
information for decision makers
• Loads the data into data
warehouses, from which end-
users can readily extract usable
data for query and analysis
INPUT CSV
STAGING TABLE
TARGET TABLE
REPORT
Amount of Spotify’s Delivered Events over time
https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/
MOVING TO
CLOUD
https://www.alooma.com/blog/best-practices-for-migrating-data-from-on-prem-to-cloud
Worldwide Cloud IT Infrastructure Market Forecast
TEST TYPES
Accuracy Testing
Completeness Testing
Data Validation Testing
Metadata Testing
Performance Testing
DWHACCURACY TESTING
It checks whether the data is accurately transformed
and loaded from the source to the data warehouse
DWHCOMPLETENESS TESTING
This verifies whether all the data from the source are
loaded into the data warehouse
DATA VALIDATION TESTING
This assesses whether the values of the data post-
transformation are the same as their expected values
with respect to the source values
METADATA TESTING
This checks whether data retains its integrity up to the
metadata level — that is, its length, indexes,
constraints, and type
PERFORMANCE TESTING
• How long it takes to process streaming data and batch
data
• How long reports/datamarts/data feeds are calculated
• SLA
TEST APPROACHES
• Test on real data
• Test code with mocks/stubs
TEST ON REAL DATA
DWHTEST ON MOCKS/STUBS
MIXTURE OF
BOTH
APPROACHES
UNIT TESTS
"WordCount" should "work" in {
JobTest[com.spotify.scio.examples.WordCount.type]
.args("--input=in.txt", "--output=out.txt")
.input(TextIO("in.txt"), inData)
.output(TextIO("out.txt")) {
coll => coll should
containInAnyOrder(expected) ()
}
.run()
}
Check that method correctly process input data file
INTEGRATION TESTS
val stream = testStreamOf[GameActionInfo]
.advanceWatermarkTo(bTime) // add some elements ahead of
the watermark
.addElements( event(blue1, 3, Duration.standardSeconds(3)),
event(blue2, 2, Duration.standardMinutes(1)),
event(red1, 3, Duration.standardSeconds(22))
) // The watermark advances slightly, but not past the end of
the window
.advanceWatermarkTo(bTime.plus(Duration.standardMinutes(3))
)
Check that method correctly read data from streaming pipeline
ACCEPTANCE TESTS
• Make each test self-sufficient and
independent
• Rely on data contract, not
implementation
• Assert data as fully as possible
TESTS SHOULD BE
•Stable
•Resistant to constant
code changes
•Fast
•Extensible
•Easily supported
TECHNOLOGY
STACK
KOTLIN
Kotlin is a general purpose, open
source, statically typed “pragmatic”
programming language for the JVM
that combines object-oriented and
functional programming features.
It is focused on interoperability, safety,
clarity, and tooling support.
SPRING
Spring Boot makes it easy to create
stand-alone, production-grade Spring
based applications that you can “just
run”.
The same for testing frameworks -
you can get started with minimum
fuss and with very little pre-
configuration.
CUCUMBER
Cucumber is a software tool to run
automated tests written in a behavior-
driven development (BDD) style.
Central to the Cucumber BDD
approach is its plain language parser
called Gherkin. It allows expected
software behaviors to be specified in
a logical language that customers can
understand.
GRADLE
Gradle is an open-source build
automation tool focused on flexibility
and performance.
Gradle build scripts are written using
a Groovy or Kotlin DSL.
COURGETTE TEST RUNNER
Courgette Test Runner is an
extension of Cucumber-JVM with
added capabilities to run Cucumber
tests in parallel on a feature level or
on a scenario level.
CODE
HOW AUTOTEST LOOKS LIKE
Feature: River project test feature
Scenario: Check Alpha feed
Given I check Alpha name field is correct
And I check Alpha views field is correct
And I check Alpha xViews field is correct
And I check Alpha yViews field is correct
And I check Alpha otherViews field is correct
And I check Alpha reportDate field is correct
Scenario: Check Beta feed
Given I check Beta passName field is correct
And I check Beta views field is correct
And I check Beta channelName field is correct
And I check Beta reportDate field is correct
HOW CODE LOOKS LIKE
@Given("^I check Alpha views field is correct$")
fun assertAlphaViewsField() {
service.checkAlphaViewsField()
}
fun checkAlphaViewsField() =
execCheckCountQuery(ALPHA_VIEWS_FIELD)
HOW RUNNER LOOKS LIKE
@RunWith(Courgette::class)
@CourgetteOptions(threads = 4,
runLevel = CourgetteRunLevel.FEATURE,
rerunFailedScenarios = false,
cucumberOptions = CucumberOptions(features =
arrayOf("resources/features"),
glue = arrayOf("com.dsobko.test"),
tags = arrayOf("@Ready", "~@Bug"),
plugin = arrayOf("pretty",
"html:build/cucumber-report")))
object CucumberFeaturesRunner
TEST REPORT
ALTERNATIVE SOLUTIONS
LINKS
https://labs.spotify.com/2016/03/10/spotifys-event-
delivery-the-road-to-the-cloud-part-iii/
https://kotlinlang.org/
https://spring.io/projects/spring-boot
https://cucumber.io/
THANKS

QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously

  • 1.
  • 2.
    ABOUT ME Dmitriy Sobko LeadQA Zoral dmitriy.sobko@gmail.com
  • 3.
    AGENDA • Big Data •BI / ETL • DWH • Cloud • Testing concepts • Framework example
  • 4.
    First, we haddata. Now we have big data. The more data there is, the more you know about things and the sharper your decisions become WHAT IS BIG DATA
  • 5.
    BUSINESS INTELLIGENCE (BI) •Know your data to make better decisions • Set of practices, architectures and technologies for gathering, processing and analyzing the data
  • 6.
    BI. CLOSER VIEW •Daily transactions and correspondences are recorded • Records are collected in databases • Data are processed and transformed into usable information • Information is analyzed to generate insight
  • 7.
    ETL • Extracts datafrom the multiple and disparate source systems such as records databases • Transforms this data into usable information for decision makers • Loads the data into data warehouses, from which end- users can readily extract usable data for query and analysis
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Amount of Spotify’sDelivered Events over time https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/
  • 15.
  • 18.
  • 22.
    TEST TYPES Accuracy Testing CompletenessTesting Data Validation Testing Metadata Testing Performance Testing
  • 23.
    DWHACCURACY TESTING It checkswhether the data is accurately transformed and loaded from the source to the data warehouse
  • 24.
    DWHCOMPLETENESS TESTING This verifieswhether all the data from the source are loaded into the data warehouse
  • 25.
    DATA VALIDATION TESTING Thisassesses whether the values of the data post- transformation are the same as their expected values with respect to the source values
  • 26.
    METADATA TESTING This checkswhether data retains its integrity up to the metadata level — that is, its length, indexes, constraints, and type
  • 27.
    PERFORMANCE TESTING • Howlong it takes to process streaming data and batch data • How long reports/datamarts/data feeds are calculated • SLA
  • 30.
    TEST APPROACHES • Teston real data • Test code with mocks/stubs
  • 31.
  • 32.
  • 33.
  • 34.
    UNIT TESTS "WordCount" should"work" in { JobTest[com.spotify.scio.examples.WordCount.type] .args("--input=in.txt", "--output=out.txt") .input(TextIO("in.txt"), inData) .output(TextIO("out.txt")) { coll => coll should containInAnyOrder(expected) () } .run() } Check that method correctly process input data file
  • 35.
    INTEGRATION TESTS val stream= testStreamOf[GameActionInfo] .advanceWatermarkTo(bTime) // add some elements ahead of the watermark .addElements( event(blue1, 3, Duration.standardSeconds(3)), event(blue2, 2, Duration.standardMinutes(1)), event(red1, 3, Duration.standardSeconds(22)) ) // The watermark advances slightly, but not past the end of the window .advanceWatermarkTo(bTime.plus(Duration.standardMinutes(3)) ) Check that method correctly read data from streaming pipeline
  • 36.
    ACCEPTANCE TESTS • Makeeach test self-sufficient and independent • Rely on data contract, not implementation • Assert data as fully as possible
  • 37.
    TESTS SHOULD BE •Stable •Resistantto constant code changes •Fast •Extensible •Easily supported
  • 38.
  • 39.
    KOTLIN Kotlin is ageneral purpose, open source, statically typed “pragmatic” programming language for the JVM that combines object-oriented and functional programming features. It is focused on interoperability, safety, clarity, and tooling support.
  • 40.
    SPRING Spring Boot makesit easy to create stand-alone, production-grade Spring based applications that you can “just run”. The same for testing frameworks - you can get started with minimum fuss and with very little pre- configuration.
  • 41.
    CUCUMBER Cucumber is asoftware tool to run automated tests written in a behavior- driven development (BDD) style. Central to the Cucumber BDD approach is its plain language parser called Gherkin. It allows expected software behaviors to be specified in a logical language that customers can understand.
  • 42.
    GRADLE Gradle is anopen-source build automation tool focused on flexibility and performance. Gradle build scripts are written using a Groovy or Kotlin DSL.
  • 43.
    COURGETTE TEST RUNNER CourgetteTest Runner is an extension of Cucumber-JVM with added capabilities to run Cucumber tests in parallel on a feature level or on a scenario level.
  • 44.
  • 45.
    HOW AUTOTEST LOOKSLIKE Feature: River project test feature Scenario: Check Alpha feed Given I check Alpha name field is correct And I check Alpha views field is correct And I check Alpha xViews field is correct And I check Alpha yViews field is correct And I check Alpha otherViews field is correct And I check Alpha reportDate field is correct Scenario: Check Beta feed Given I check Beta passName field is correct And I check Beta views field is correct And I check Beta channelName field is correct And I check Beta reportDate field is correct
  • 46.
    HOW CODE LOOKSLIKE @Given("^I check Alpha views field is correct$") fun assertAlphaViewsField() { service.checkAlphaViewsField() } fun checkAlphaViewsField() = execCheckCountQuery(ALPHA_VIEWS_FIELD)
  • 47.
    HOW RUNNER LOOKSLIKE @RunWith(Courgette::class) @CourgetteOptions(threads = 4, runLevel = CourgetteRunLevel.FEATURE, rerunFailedScenarios = false, cucumberOptions = CucumberOptions(features = arrayOf("resources/features"), glue = arrayOf("com.dsobko.test"), tags = arrayOf("@Ready", "~@Bug"), plugin = arrayOf("pretty", "html:build/cucumber-report"))) object CucumberFeaturesRunner
  • 48.
  • 49.
  • 52.
  • 53.