Dr N A Ganai
Professor
Centre of Animal Biotechnology
SKUAST-Kashmir
Contents
 Introduction to Bioinformatics
 Complexity of life
 Size of genome
 Exponential growth in information
generation
 Why and how to handle this
information
 Definition of Bioinformatics?
 Data bases
 Tools
 Scope of Bioinformatics
 Anticipated benefits
 Ethical, Legal, and Social
Issues
DNA is not merely a molecule with a pattern;
it is a code, a language, and an information
storage mechanism
Size of Human Genome
 Each cell carries: 3.2 billion base pairs
 A code you need to write in 500 books, each book of
500 pages
 Length of DNA in adult man:
 The total length of DNA present in one adult human is
calculated as:
 (length of 1 bp)(number of bp per cell)(number of cells in the body)
 (0.34 × 10-9 m)(6 × 109)(1013)
 2.0 × 1013 meters
 That is the equivalent of nearly 70 trips from the earth
to the sun and back.
Human Genome Project
• HGP: International research effort
• Began 1990, completed 2003
• Biggest ever project in life
sciences
• 20 labs participated world
around
• Next steps for ~30,000 genes
– Function and regulation of all genes
– Significance of variations between
people
– Cures, therapies, “genomic
healthcare”
From DNA to Cell Function
DNA sequence
(split into genes)
Amino Acid
Sequence
Protein
3D
Structure
Protein
Function
Cell
Activity
codes for
folds into
dictates determines
has
Lecture 2
Genomics
Transcriptomics
Proteomics
Metabolomics
Year Base Pairs Sequences
1982 680,338 606
1983 2,274,029 2,427
1984 3,368,765 4,175
1985 5,204,420 5,700
1986 9,615,371 9,978
1987 15,514,776 14,584
1988 23,800,000 20,579
1989 34,762,585 28,791
1990 49,179,285 39,533
1991 71,947,426 55,627
1992 101,008,486 78,608
1993 157,152,442 143,492
1994 217,102,462 215,273
1995 384,939,485 555,694
1996 651,972,984 1,021,211
1997 1,160,300,687 1,765,847
1998 2,008,761,784 2,837,897
1999 3,841,163,011 4,864,570
2000 11,101,066,288 10,106,023
2001 15,849,921,438 14,976,310
2002 28,507,990,166 22,318,883
2003 36,553,368,485 30,968,418
2004 44,575,745,176 40,604,319
2005 56,037,734,462 52,016,762
2006 69,019,290,705 64,893,747
2007 83,874,179,730 80,388,382
2008 99,116,431,942 98,868,465
Av. Growth in data generation :
5400 times per year
Exponential Growth in Biological Databases:
High throughput Technologies
PCR : by Kary Mullis 1983 - an employee of Cetus Corporation, a
biotechnology firm in California
 Awarded the Nobel Prize for the discovery of PCR in 1993
 Microarray Technology
 Real-Time PCR
 DNA Chips
Sequencing
 Sanger method : 1975
 Chain Termination Method
 Maxam Gilbert : 1977
 Chemical Modification Method
 Next Generation: 1994
 High Throughput
 Parallel sequencing
 Entire genome can be sequenced
in a matter of weeks
History of DNA Sequencing
Avery: Proposes DNA as ‘Genetic Material’
Watson & Crick: Double Helix Structure of DNA
Holley: Sequences Yeast tRNAAla
1870
1953
1940
1965
1970
1977
1980
1990
2002
Miescher: Discovers DNA
Wu: Sequences  Cohesive End DNA
Sanger: Dideoxy Chain Termination
Gilbert: Chemical Degradation
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing
• Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
1986
• Next Generation Sequencing
•Improved enzymes and chemistry
•Improved image processing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
1
15
150
50,000
25,000
1,500
200,000
50,000,000
Efficiency
(bp/person/year)
15,000
100,000,000,000 2008
The Genome Sequence
is at hand…so?
“The good news is that we have the human genome.
The bad news is it’s just a parts list”
• Gene number, exact locations, and functions
• Gene regulation
• DNA sequence organization
• Noncoding DNA types, amount, distribution, information content, and functions
• Coordination of gene expression, protein synthesis, and post-translational events
• Interaction of proteins in complex molecular machines
• Predicted vs experimentally determined gene function
• Evolutionary conservation among organisms
• Protein conservation (structure and function)
• Proteomes (total protein content and function) in organisms
• Correlation of SNPs (single-base DNA variations among individuals) with health and
disease
• Disease-susceptibility prediction based on gene sequence variation
• Genes involved in complex traits and multigene diseases
• Complex systems biology including microbial consortia useful for environmental
restoration
• Developmental genetics, genomics
What Next???
We need to know every part, its function
and application
What is Bioinformatics?
 The newest, fastest growing specialty
in the life sciences that integrates
biotechnology and computer science.
 Computers aid to collect, analyze,
and interpret biological information
at the molecular level.
 Bioinformatics encompasses a set of
software tools that aid in:
 molecular sequence analysis,
 structural analysis
 functional analysis
of genes & genomes and their
corresponding products
 Understand a living cell and how it
functions at molecular level
 Develop data basses and
computational tools
 Tools are used to mine (analyze)
databases to generate knowledge
to better understand the living
systems
Goal of Bioinformatics
Biological Data basses : Why
 Why?
 Store all the data (information) related to Genomics, Transcriptomics,
preoteomics, Metabolomics in Data Bases
 Make biological data available to scientists.
 To make biological data available in computer-readable form.
 Types of Databases
 Primary Databases: Store raw DNA/RNA and protein data
submitted by scientists
 GenBank: by NCBI USA www.ncbi.nlm.nih.gov/genbank/
 EMBL: European : www.ebi.ac.uk/embl/
 DDBJ: Japan www.ddbj.nig.ac.jp/
 PDB: Protein Data bank http://www.rcsb.org/pdb/home/home.do
Data Bases … cont.
Secondary data bases: Contain computationally processed or
manually curetted information based on primary data bases.
 SWISS-Prot: Curetted protein data base www.ebi.ac.uk/swissprot
 TrEMBL: Translated Nucleic acid sequences in EMBL
 PIR: annotated protein sequences
 UniProt: Combined database of SWISSProt, TrEMBL, PIR
 Prosite
 PRINTS
 BLOCKS
 PFAM
Specialized Data bases :cater to a particular research interest
 FlyBase
 HIV Sequence data base
 Ribosome data base
 OMIM
 Microarray Gene expression database
 ExPASY etc. etc.
We need
Bioinformatics Tools…
To mine (analyze) databases to generate knowledge to
better understand the living systems
 Search/compare databases
 Sequence Analysis
 Genomics
 Phylogenics
 Structure Prediction
 Molecular Modelling
 Microarrays
 Packages, Misc Apps, Graphics, Scripts
Examples of Bioinformatics Tools
 Database interfaces (Search Tools)
 Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
 Sequence alignment
 BLAST, FASTA (Fast All)
 Multiple sequence alignment
 Clustal, MultAlin, DiAlign
 Gene finding
 Genscan, GenomeScan, GeneMark, GRAIL
 Protein Domain analysis and identification
 pfam, BLOCKS, ProDom,
 Pattern Identification/Characterization
 Gibbs Sampler, AlignACE, MEME
 Protein Folding prediction
 PredictProtein, SwissModeler
Five websites
that all biologists should Bookmark
 NCBI (The National Center for Biotechnology Information;
 http://www.ncbi.nlm.nih.gov/
 EBI (The European Bioinformatics Institute)
 http://www.ebi.ac.uk/
 The Canadian Bioinformatics Resource
 http://www.cbr.nrc.ca/
 SwissProt/ExPASy (Swiss Bioinformatics Resource)
 http://expasy.cbr.nrc.ca/sprot/
 PDB (The Protein Databank)
 http://www.rcsb.org/PDB/
Anticipated Benefits of
Genome Research & Bioinformatics
Molecular Medicine : Gene Testing ,
Pharmacogenomics
Gene Therapy
 improve diagnosis of disease
 detect genetic predispositions to disease
 create drugs based on molecular information
 use gene therapy and control systems as drugs
 design “custom drugs” (pharmacogenomics) based on
individual genetic profiles
Microbial Genomics
 rapidly detect and treat pathogens in clinical practice
 develop new energy sources (biofuels)
 monitor environments to detect pollutants
 protect citizenry from biological and chemical warfare
 clean up toxic waste safely and efficiently
DNA Identification (Forensics)
 identify potential suspects whose DNA may
match evidence left at crime scenes
 exonerate persons wrongly accused of
crimes
 establish paternity and other family
relationships
 identify endangered and protected species
as an aid to wildlife officials (could be
 detect bacteria and other organisms that
may pollute air, water, soil, and food
 match organ donors with recipients in
transplant programs
 determine pedigree for seed or livestock
breeds
Benefits: …contined
Agriculture, Livestock Breeding, and
Bioprocessing
 grow disease-, insect-, and drought-resistant crops
 breed healthier, more productive, disease-resistant
farm animals
 grow more nutritious produce
 develop biopesticides
 incorporate edible vaccines incorporated into food
products
 develop new environmental cleanup uses for
plants like tobacco
Benefits …cont
.
ELSI: Ethical, Legal,
and Social Issues
• Privacy and confidentiality of genetic information.
• Fairness in the use of genetic information by insurers, employers,
courts, schools, adoption agencies, and the military, among others.
• Psychological impact, stigmatization, and discrimination due to an
individual’s genetic differences.
• Reproductive issues including adequate and informed consent and
use of genetic information in reproductive decision making.
• Clinical issues including the education of doctors and other health-
service providers, people identified with genetic conditions, and the
general public about capabilities, limitations, and social risks; and
implementation of standards and quality-control measures.
Health and environmental issues concerning genetically modified foods
(GM) and microbes.
Commercialization of products including property rights (patents,
copyrights, and trade secrets) and accessibility of data and materials.
Common Questions
of a Student of biology
Bioinformatics  issues and challanges  presentation at s p college

Bioinformatics issues and challanges presentation at s p college

  • 1.
    Dr N AGanai Professor Centre of Animal Biotechnology SKUAST-Kashmir
  • 2.
    Contents  Introduction toBioinformatics  Complexity of life  Size of genome  Exponential growth in information generation  Why and how to handle this information  Definition of Bioinformatics?  Data bases  Tools  Scope of Bioinformatics  Anticipated benefits  Ethical, Legal, and Social Issues
  • 4.
    DNA is notmerely a molecule with a pattern; it is a code, a language, and an information storage mechanism
  • 5.
    Size of HumanGenome  Each cell carries: 3.2 billion base pairs  A code you need to write in 500 books, each book of 500 pages  Length of DNA in adult man:  The total length of DNA present in one adult human is calculated as:  (length of 1 bp)(number of bp per cell)(number of cells in the body)  (0.34 × 10-9 m)(6 × 109)(1013)  2.0 × 1013 meters  That is the equivalent of nearly 70 trips from the earth to the sun and back.
  • 7.
    Human Genome Project •HGP: International research effort • Began 1990, completed 2003 • Biggest ever project in life sciences • 20 labs participated world around • Next steps for ~30,000 genes – Function and regulation of all genes – Significance of variations between people – Cures, therapies, “genomic healthcare”
  • 9.
    From DNA toCell Function DNA sequence (split into genes) Amino Acid Sequence Protein 3D Structure Protein Function Cell Activity codes for folds into dictates determines has Lecture 2
  • 10.
  • 11.
    Year Base PairsSequences 1982 680,338 606 1983 2,274,029 2,427 1984 3,368,765 4,175 1985 5,204,420 5,700 1986 9,615,371 9,978 1987 15,514,776 14,584 1988 23,800,000 20,579 1989 34,762,585 28,791 1990 49,179,285 39,533 1991 71,947,426 55,627 1992 101,008,486 78,608 1993 157,152,442 143,492 1994 217,102,462 215,273 1995 384,939,485 555,694 1996 651,972,984 1,021,211 1997 1,160,300,687 1,765,847 1998 2,008,761,784 2,837,897 1999 3,841,163,011 4,864,570 2000 11,101,066,288 10,106,023 2001 15,849,921,438 14,976,310 2002 28,507,990,166 22,318,883 2003 36,553,368,485 30,968,418 2004 44,575,745,176 40,604,319 2005 56,037,734,462 52,016,762 2006 69,019,290,705 64,893,747 2007 83,874,179,730 80,388,382 2008 99,116,431,942 98,868,465 Av. Growth in data generation : 5400 times per year
  • 13.
    Exponential Growth inBiological Databases: High throughput Technologies PCR : by Kary Mullis 1983 - an employee of Cetus Corporation, a biotechnology firm in California  Awarded the Nobel Prize for the discovery of PCR in 1993
  • 14.
     Microarray Technology Real-Time PCR  DNA Chips
  • 15.
    Sequencing  Sanger method: 1975  Chain Termination Method  Maxam Gilbert : 1977  Chemical Modification Method  Next Generation: 1994  High Throughput  Parallel sequencing  Entire genome can be sequenced in a matter of weeks
  • 16.
    History of DNASequencing Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNAAla 1870 1953 1940 1965 1970 1977 1980 1990 2002 Miescher: Discovers DNA Wu: Sequences  Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation • Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes 1986 • Next Generation Sequencing •Improved enzymes and chemistry •Improved image processing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1 15 150 50,000 25,000 1,500 200,000 50,000,000 Efficiency (bp/person/year) 15,000 100,000,000,000 2008
  • 17.
    The Genome Sequence isat hand…so? “The good news is that we have the human genome. The bad news is it’s just a parts list”
  • 18.
    • Gene number,exact locations, and functions • Gene regulation • DNA sequence organization • Noncoding DNA types, amount, distribution, information content, and functions • Coordination of gene expression, protein synthesis, and post-translational events • Interaction of proteins in complex molecular machines • Predicted vs experimentally determined gene function • Evolutionary conservation among organisms • Protein conservation (structure and function) • Proteomes (total protein content and function) in organisms • Correlation of SNPs (single-base DNA variations among individuals) with health and disease • Disease-susceptibility prediction based on gene sequence variation • Genes involved in complex traits and multigene diseases • Complex systems biology including microbial consortia useful for environmental restoration • Developmental genetics, genomics What Next??? We need to know every part, its function and application
  • 19.
    What is Bioinformatics? The newest, fastest growing specialty in the life sciences that integrates biotechnology and computer science.  Computers aid to collect, analyze, and interpret biological information at the molecular level.  Bioinformatics encompasses a set of software tools that aid in:  molecular sequence analysis,  structural analysis  functional analysis of genes & genomes and their corresponding products
  • 20.
     Understand aliving cell and how it functions at molecular level  Develop data basses and computational tools  Tools are used to mine (analyze) databases to generate knowledge to better understand the living systems Goal of Bioinformatics
  • 21.
    Biological Data basses: Why  Why?  Store all the data (information) related to Genomics, Transcriptomics, preoteomics, Metabolomics in Data Bases  Make biological data available to scientists.  To make biological data available in computer-readable form.  Types of Databases  Primary Databases: Store raw DNA/RNA and protein data submitted by scientists  GenBank: by NCBI USA www.ncbi.nlm.nih.gov/genbank/  EMBL: European : www.ebi.ac.uk/embl/  DDBJ: Japan www.ddbj.nig.ac.jp/  PDB: Protein Data bank http://www.rcsb.org/pdb/home/home.do
  • 22.
    Data Bases …cont. Secondary data bases: Contain computationally processed or manually curetted information based on primary data bases.  SWISS-Prot: Curetted protein data base www.ebi.ac.uk/swissprot  TrEMBL: Translated Nucleic acid sequences in EMBL  PIR: annotated protein sequences  UniProt: Combined database of SWISSProt, TrEMBL, PIR  Prosite  PRINTS  BLOCKS  PFAM Specialized Data bases :cater to a particular research interest  FlyBase  HIV Sequence data base  Ribosome data base  OMIM  Microarray Gene expression database  ExPASY etc. etc.
  • 23.
    We need Bioinformatics Tools… Tomine (analyze) databases to generate knowledge to better understand the living systems  Search/compare databases  Sequence Analysis  Genomics  Phylogenics  Structure Prediction  Molecular Modelling  Microarrays  Packages, Misc Apps, Graphics, Scripts
  • 24.
    Examples of BioinformaticsTools  Database interfaces (Search Tools)  Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …  Sequence alignment  BLAST, FASTA (Fast All)  Multiple sequence alignment  Clustal, MultAlin, DiAlign  Gene finding  Genscan, GenomeScan, GeneMark, GRAIL  Protein Domain analysis and identification  pfam, BLOCKS, ProDom,  Pattern Identification/Characterization  Gibbs Sampler, AlignACE, MEME  Protein Folding prediction  PredictProtein, SwissModeler
  • 25.
    Five websites that allbiologists should Bookmark  NCBI (The National Center for Biotechnology Information;  http://www.ncbi.nlm.nih.gov/  EBI (The European Bioinformatics Institute)  http://www.ebi.ac.uk/  The Canadian Bioinformatics Resource  http://www.cbr.nrc.ca/  SwissProt/ExPASy (Swiss Bioinformatics Resource)  http://expasy.cbr.nrc.ca/sprot/  PDB (The Protein Databank)  http://www.rcsb.org/PDB/
  • 26.
    Anticipated Benefits of GenomeResearch & Bioinformatics Molecular Medicine : Gene Testing , Pharmacogenomics Gene Therapy  improve diagnosis of disease  detect genetic predispositions to disease  create drugs based on molecular information  use gene therapy and control systems as drugs  design “custom drugs” (pharmacogenomics) based on individual genetic profiles Microbial Genomics  rapidly detect and treat pathogens in clinical practice  develop new energy sources (biofuels)  monitor environments to detect pollutants  protect citizenry from biological and chemical warfare  clean up toxic waste safely and efficiently
  • 27.
    DNA Identification (Forensics) identify potential suspects whose DNA may match evidence left at crime scenes  exonerate persons wrongly accused of crimes  establish paternity and other family relationships  identify endangered and protected species as an aid to wildlife officials (could be  detect bacteria and other organisms that may pollute air, water, soil, and food  match organ donors with recipients in transplant programs  determine pedigree for seed or livestock breeds Benefits: …contined
  • 28.
    Agriculture, Livestock Breeding,and Bioprocessing  grow disease-, insect-, and drought-resistant crops  breed healthier, more productive, disease-resistant farm animals  grow more nutritious produce  develop biopesticides  incorporate edible vaccines incorporated into food products  develop new environmental cleanup uses for plants like tobacco Benefits …cont .
  • 29.
    ELSI: Ethical, Legal, andSocial Issues • Privacy and confidentiality of genetic information. • Fairness in the use of genetic information by insurers, employers, courts, schools, adoption agencies, and the military, among others. • Psychological impact, stigmatization, and discrimination due to an individual’s genetic differences. • Reproductive issues including adequate and informed consent and use of genetic information in reproductive decision making. • Clinical issues including the education of doctors and other health- service providers, people identified with genetic conditions, and the general public about capabilities, limitations, and social risks; and implementation of standards and quality-control measures. Health and environmental issues concerning genetically modified foods (GM) and microbes. Commercialization of products including property rights (patents, copyrights, and trade secrets) and accessibility of data and materials.
  • 30.
    Common Questions of aStudent of biology