Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

FGCSForum
Roma,April24,2016
P..Misiser
Scalable Whole-Exome Sequence Data
Processing Using Workflow On A Cloud
Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
FGCS Forum
Roma, April 24, 2016

FGCSForum
Roma,April24,2016
P..Misiser
The challenge
• Port an existing WES/WGS pipeline
• From HPC to a (public) cloud
• While achieving more flexibility and better abstraction
• With better performance than the equivalent HPC deployment

FGCSForum
Roma,April24,2016
P..Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality scores
assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects both
SNV as well as longer indels
Variant recalibration
attempts to reduce
false positive rate
from caller

FGCSForum
Roma,April24,2016
P..Misiser
The original implementation
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
• Pros
• simplicity – 50-100 lines of bash code
• flexibility of the bash language
• Cons
• embedded dependencies between steps
• low-level configuration

FGCSForum
Roma,April24,2016
P..Misiser
Problem scale
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)

FGCSForum
Roma,April24,2016
P..Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model

FGCSForum
Roma,April24,2016
P..Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks

FGCSForum
Roma,April24,2016
P..Misiser
Workflow design
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Conceptual:
Actual:
11 workflows
101 blocks
28 tool blocks

FGCSForum
Roma,April24,2016
P..Misiser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing of
samples in a batch

FGCSForum
Roma,April24,2016
P..Misiser
Anatomy of a complex parallel dataflow
… with hierarchical structure

FGCSForum
Roma,April24,2016
P..Misiser
Cloud Deployment
Design
Cloud
Deployment
Execution Analysis
Scalability
• Exploiting data parallelism
• Fewer installation/deployment requirements, staff hours
required
• Automated dependency management, packaging
• Configurable to make most efficient use of a cluster

FGCSForum
Roma,April24,2016
P..Misiser
Parallelism in the pipeline
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
align, clean,
recalibrate
call variants
annotate
align, clean,
recalibrate
align, clean,
recalibrate
Stage 1 Stage 2 Stage 3
annotate
annotate
call variants
call variants
Chr1
Chr1
Chr1
Chr2
Chr2
Chr2
ChrM
ChrM
ChrM
chromosomesplit
samplesplit
chromosomesplit
samplesplit
Sample 1
Sample 2
Sample N
Annotated
variants
Annotated
variants
Annotated
variants
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III

FGCSForum
Roma,April24,2016
P..Misiser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Modular architecture  indefinitely scalable!

FGCSForum
Roma,April24,2016
P..Misiser
Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)

FGCSForum
Roma,April24,2016
P..Misiser
Scripts to workflow
Design
Cloud
Deployment
Execution Analysis
3. Execution
• Runtime monitoring
• provenance collection

FGCSForum
Roma,April24,2016
P..Misiser
Performance
Configurations for 3VMs experiments:
HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160
GB scratch space
Azure workflow engines:
D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)

FGCSForum
Roma,April24,2016
P..Misiser
Comparison with HPC
0
24
48
72
96
120
144
168
0 6 12 18 24
Responsetime[hours]
Number of input samples
HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained
0
1
2
3
4
5
6
0 50 100 150 200 250 300 350 400
Systemthroughput[GiB/hr]
Size of the sample cohort [GiB]
HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained

FGCSForum
Roma,April24,2016
P..Misiser
Scalability
There is little incentive to grow the VM pool beyond 6 engines

FGCSForum
Roma,April24,2016
P..Misiser
Cost
Again, a 6 engine configuration achieves near-optimal cost/sample
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
0 6 12 18 24
0
2
4
6
8
10
12
14
16
18
Size of the input data [GiB]
CostperGiB[£]
Number of samples
Costpersample[£]
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)

FGCSForum
Roma,April24,2016
P..Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
 Better abstraction
• Easier to understand, share,
maintain
 Better exploit data parallelism
 Extensible by wrapping new tools
• Scalability
 Fewer installation/deployment
requirements, staff hours required
 Automated dependency management,
packaging
 Configurable to make most efficient
use of a cluster
 Runtime monitoring
 Provenance collection
 Reproducibility
 Accountability

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

More Related Content

What's hot

Similar to Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

More from Paolo Missier

Recently uploaded

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

Editor's Notes