Cloud BioLinux: open source, fully-customizable
 bioinformatics computing on the cloud for the
       genomics community and beyond

          BOSC 2011 - Vienna, Austria



                  Ntino Krampis, PhD
                     Asst. Professor
            J. Craig Venter Institute (JCVI)
                 agbiotec@gmail.com
Expensive sequencing and large organizations
                   Commodity sequencing and small labs

●
    large sequencing center, multi-million, broad-impact sequencing projects
●   dedicated bioinformatics department, large Sun Grid Engine cluster


●   small-factor, bench-top sequencer available: GS Junior by 454
●   sequencing as a standard technique in basic biology and genetics research
●   RNAseq and ChiPseq, and each biologist will be tackling a metagenome
Will small labs become the long tail of sequencing ?




   amount of
   sequencing         Credit: WikiMedia Commons




                  number of labs
“Bioinformatics nation is a land of city-states” Lincoln Stein

●   small labs building small-scale bioinformatics infrastructures
●   duplication of effort in compiling and installing software tools
●   some labs have no hardware, expertise, or time to install and run software

●   NEBC BioLinux ( tinyurl.com/BioLinux-NEBC ) 100+ pre-configured tools
●   example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS



    how about large-scale sequence datasets ?
Cloud BioLinux
      pre-configured and on-demand bioinformatics computing on the cloud



                        ●   JCVI cloud computing research
                        ●   NEBC bioinformatics software repository
        +               ●   community effort – Hackathon / BOSC 2010 - 11
                        ●   pre-configured Virtual Machine (VM, image)

                        ● large-scale computing independently of institutional or
                        geographic boundaries
        =               ●   only need a desktop computer with internet access




cloudbiolinux.org
Cloud BioLinux
                 simple for end-users


                                                    signup at

                                                aws.amazon.com
                                                      then
                                             aws.amazon.com/console
                                                      and




http://tinyurl.com/cloud-biolinux-tutorial
Amazon EC2
→
linux desktop
via remote
desktop client
What if I want to
    share my
alignments with
a collaborator?

save your data as
   a new VM

  0.10$ / GB /
     month

at 15GB, it costs
  1.5$ / month
“whole system snapshot exchange” (Dudley and Butte 2010)
capture the state of the computing system and data
software execution parameters and “massaged” input datasets
Cloud BioLinux developer's framework
        create cloud VM / images with standardized software configurations



●   customize Cloud BioLinux based on community requirements

●   mix and match software from NEBC or other (DebianMed, Scientific Linux etc.)

●   share customized VMs with collaborators, avoiding effort duplication

●   deploy Cloud BioLinux on private and local clouds
Cloud BioLinux developer's framework

     ●   based on python-fabric auto-deployment tool

     ●   software components listed in plain text files

     ●   collaborators use files to share descriptions of cloud VM / images

     ●   start with a bare-bones VM / image

     ●   fabric downloads and installs specified software




tinyurl.com/python-fabric        open.eucalyptus.com
software domains in bioinformatics: nextgen
sequencing, de novo assembly, annotation, phylogeny,
    molecular structures, gene expression analysis


        github.com/chapmanb/cloudbiolinux
Cloud Biolinux
                                  The future


●   expand community, receive feedback, add more software to the VM

●   groups.google.com/cloudbiolinux and cloudbiolinux.org

●   add data analysis pipelines that are used by sequencing centers

●   actively seeking funding to put major effort in development

●   2011 ISMB/BOSC in Vienna, Austria, http://metalab.at/

●
Acknowledgments & Credits

Brad Chapman     - development of the fabric scripts and community organizer
Tim Booth, Mesude Bicak, Dawn Field, Bela Tiwari – BioLinux 6.0
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation
Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop
Members of the Cloud Biolinux community – precious development time


       Thank you !

F02-Cloud-Cloud BioLinux

  • 1.
    Cloud BioLinux: opensource, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond BOSC 2011 - Vienna, Austria Ntino Krampis, PhD Asst. Professor J. Craig Venter Institute (JCVI) agbiotec@gmail.com
  • 2.
    Expensive sequencing andlarge organizations Commodity sequencing and small labs ● large sequencing center, multi-million, broad-impact sequencing projects ● dedicated bioinformatics department, large Sun Grid Engine cluster ● small-factor, bench-top sequencer available: GS Junior by 454 ● sequencing as a standard technique in basic biology and genetics research ● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
  • 3.
    Will small labsbecome the long tail of sequencing ? amount of sequencing Credit: WikiMedia Commons number of labs
  • 4.
    “Bioinformatics nation isa land of city-states” Lincoln Stein ● small labs building small-scale bioinformatics infrastructures ● duplication of effort in compiling and installing software tools ● some labs have no hardware, expertise, or time to install and run software ● NEBC BioLinux ( tinyurl.com/BioLinux-NEBC ) 100+ pre-configured tools ● example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS how about large-scale sequence datasets ?
  • 5.
    Cloud BioLinux pre-configured and on-demand bioinformatics computing on the cloud ● JCVI cloud computing research ● NEBC bioinformatics software repository + ● community effort – Hackathon / BOSC 2010 - 11 ● pre-configured Virtual Machine (VM, image) ● large-scale computing independently of institutional or geographic boundaries = ● only need a desktop computer with internet access cloudbiolinux.org
  • 6.
    Cloud BioLinux simple for end-users signup at aws.amazon.com then aws.amazon.com/console and http://tinyurl.com/cloud-biolinux-tutorial
  • 7.
    Amazon EC2 → linux desktop viaremote desktop client
  • 8.
    What if Iwant to share my alignments with a collaborator? save your data as a new VM 0.10$ / GB / month at 15GB, it costs 1.5$ / month
  • 9.
    “whole system snapshotexchange” (Dudley and Butte 2010) capture the state of the computing system and data software execution parameters and “massaged” input datasets
  • 10.
    Cloud BioLinux developer'sframework create cloud VM / images with standardized software configurations ● customize Cloud BioLinux based on community requirements ● mix and match software from NEBC or other (DebianMed, Scientific Linux etc.) ● share customized VMs with collaborators, avoiding effort duplication ● deploy Cloud BioLinux on private and local clouds
  • 11.
    Cloud BioLinux developer'sframework ● based on python-fabric auto-deployment tool ● software components listed in plain text files ● collaborators use files to share descriptions of cloud VM / images ● start with a bare-bones VM / image ● fabric downloads and installs specified software tinyurl.com/python-fabric open.eucalyptus.com
  • 12.
    software domains inbioinformatics: nextgen sequencing, de novo assembly, annotation, phylogeny, molecular structures, gene expression analysis github.com/chapmanb/cloudbiolinux
  • 13.
    Cloud Biolinux The future ● expand community, receive feedback, add more software to the VM ● groups.google.com/cloudbiolinux and cloudbiolinux.org ● add data analysis pipelines that are used by sequencing centers ● actively seeking funding to put major effort in development ● 2011 ISMB/BOSC in Vienna, Austria, http://metalab.at/ ●
  • 14.
    Acknowledgments & Credits BradChapman - development of the fabric scripts and community organizer Tim Booth, Mesude Bicak, Dawn Field, Bela Tiwari – BioLinux 6.0 J. Craig Venter Inst. - time allowed to work on an open-source project D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop Members of the Cloud Biolinux community – precious development time Thank you !