Protein sequence databases
Introduction:
The Protein database is a collection of sequences from several sources, including translations from
annotated coding regions in GenBank, RefSeqand TPA, as well as records from SwissProt, PIR,
PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and
function.
SWISS-PROT
– Manually curated
– high-quality annotations, less data
GenPept/TREMBL
– Translated coding sequences from GenBank/EMBL
– Few annotations, more up to date
PIR
– Phylogenetic-based annotations
All 3 now combining efforts to form UniProt (http://www.uniprot.org)
PDB (Protein Databank)
 Stores 3-dimensional atomic coordinates for biological molecules including protein and
nucleic acids
 Data obtained by X-ray crystallography, NMR, or computer modelling
http://www.rcsb.org/pdb/
MMDB (Molecular Modelling database)
Over 28,000 3D macromolecular structures, including proteins and
polynucleotides(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
SCOP (Structural Classification of Proteins)
Classification of proteins according to structural and evolutionary relationships
SWISS-PROT
Introduction:
SWISS-PROT is an annotated protein sequence database, which was created at the
Department of Medical Biochemistry of the University of Geneva and has been a collaborative
effort of the Department and the European Molecular Biology Laboratory (EMBL), since 1987.
SWISS-PROT is now an equal partnership between the EMBL and the Swiss Institute of
Bioinformatics (SIB). The EMBL activities are carried out by its Hinxton Outstation, the European
Bioinformatics Institute (EBI). The SWISS-PROT protein sequence database consists of sequence
entries. Sequence entries are composed of different line types, each with their own format.
The SWISS-PROT database distinguishes itself from other protein sequence databases by three
distinct criteria:
(i) annotations
(ii) (ii) minimal redundancy and
(iii) (iii) integration with other databases.
Annotations
CORE DATA
• The sequence data
• The citation information (bibliographical references)
• The taxonomic data (description of the biological source of the protein)
Annotation- Additional Data
• Descriptions include:
• Function(s) of the protein
• Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and
GPI-anchor
• Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers,
homeoboxes, and SH2 and SH3 domains
• Secondary structure, e.g. alpha helix, beta sheet
• Quaternary structure, i.g. homodimer, heterotrimer, etc.
• Similarities to other proteins
• Disease(s) associated with any number of deficiencies in the protein
• Sequence conflicts, variants, etc.
Minimal Redundancy
• Much of data comes from more than one literature report
• Data condensed and merged to appear more concise and coherent
• Conflicts in data are listed for each entry
Integration with other databases
• 50+ databases for cross-reference
• Nucleic acid sequences, protein tertiary structure, protein 3-D models, etc.
• Allows Swiss-PROT to play a major role as the focal point for biomolecular
interconnectivity
Documentation
• All files documented and indexed
• Documentation kept up-to-date
Applications for the Knowledgebase
• Provides highly organized data and information on a wide variety of proteins
• Can be used as a starting point for protein research
• Allows searches to be conducted starting with various search strings
• Biochemical encyclopedia
SWISS-PROT Flat File format
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.

Protein sequence databases

  • 1.
    Protein sequence databases Introduction: TheProtein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeqand TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function. SWISS-PROT – Manually curated – high-quality annotations, less data GenPept/TREMBL – Translated coding sequences from GenBank/EMBL – Few annotations, more up to date PIR – Phylogenetic-based annotations All 3 now combining efforts to form UniProt (http://www.uniprot.org) PDB (Protein Databank)  Stores 3-dimensional atomic coordinates for biological molecules including protein and nucleic acids  Data obtained by X-ray crystallography, NMR, or computer modelling http://www.rcsb.org/pdb/ MMDB (Molecular Modelling database) Over 28,000 3D macromolecular structures, including proteins and polynucleotides(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure) SCOP (Structural Classification of Proteins) Classification of proteins according to structural and evolutionary relationships SWISS-PROT Introduction: SWISS-PROT is an annotated protein sequence database, which was created at the Department of Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal partnership between the EMBL and the Swiss Institute of
  • 2.
    Bioinformatics (SIB). TheEMBL activities are carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI). The SWISS-PROT protein sequence database consists of sequence entries. Sequence entries are composed of different line types, each with their own format. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria: (i) annotations (ii) (ii) minimal redundancy and (iii) (iii) integration with other databases. Annotations CORE DATA • The sequence data • The citation information (bibliographical references) • The taxonomic data (description of the biological source of the protein) Annotation- Additional Data • Descriptions include: • Function(s) of the protein • Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor • Domains and sites, for example, calcium-binding regions, ATP-binding sites, zinc fingers, homeoboxes, and SH2 and SH3 domains • Secondary structure, e.g. alpha helix, beta sheet • Quaternary structure, i.g. homodimer, heterotrimer, etc. • Similarities to other proteins • Disease(s) associated with any number of deficiencies in the protein • Sequence conflicts, variants, etc. Minimal Redundancy • Much of data comes from more than one literature report • Data condensed and merged to appear more concise and coherent • Conflicts in data are listed for each entry Integration with other databases • 50+ databases for cross-reference
  • 3.
    • Nucleic acidsequences, protein tertiary structure, protein 3-D models, etc. • Allows Swiss-PROT to play a major role as the focal point for biomolecular interconnectivity Documentation • All files documented and indexed • Documentation kept up-to-date Applications for the Knowledgebase • Provides highly organized data and information on a wide variety of proteins • Can be used as a starting point for protein research • Allows searches to be conducted starting with various search strings • Biochemical encyclopedia
  • 4.
  • 5.
    ID - Identification. AC- Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments. RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes.
  • 6.
    DR - Databasecross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line.