Protein Databases
  • Post last modified:2023-12-09

The biological information of proteins is available as sequences and structures. Sequences are represented in a single dimension whereas the structure contains the three-dimensional data of sequences.  Protein databases are datasets about proteins, which could include a protein’s amino acid sequence, conformation, structure, and features such as active sites.

Protein databases

Primary databases hold protein sequences inferred from the conceptual translation of the nucleotide sequences.  This is not experimentally derived information but has arisen from interpretation of the nucleotide sequence information.  Consequently, it must be treated as potentially containing misinterpreted information.

 

Protein sequence databases

UniProt (Universal Protein Resource) is a comprehensive resource for protein sequence and annotation data including functional information.  The databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).  UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) of Georgetown University Medical Center.  The combined effort replaced the databases previously produced by these institutes: Swiss-Prot, TrEMBL (Translated EMBL Nucleotide Sequence Data Library), and Protein Sequence Database (PIR-PSD).

neXtProt is a comprehensive human-centric protein discovery platform, developed by the Swiss Institute of Bioinformatics (SIB).  It strives to be a comprehensive resource that provides a variety of information on human proteins, such as their function, subcellular location, expression, interactions, and role in diseases.  It offers its users a seamless integration of and navigation through protein-related data.

DisProt is a manually curated biological database of intrinsically disordered proteins (IDPs) and regions (IDRs) collected from the literature.  It is hosted and maintained in the BioComputing UP laboratory (University of Padua, Italy).

MobiDB  is a database of intrinsic protein disorder annotation (University of Padua, Italy).  It provides information about intrinsically disordered regions (IDRs) and related features from various sources and prediction tools. By combining different data sources of protein disorder into a consensus annotation, it aims at giving the best possible picture of the “disorder landscape” of a given protein of interest.

InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites.  It unifies the data from the consortium members and also adds relevant GO terms.

Pfam is a database of protein families, produced by EMBL-EBI, that includes their annotations and multiple sequence alignments.

PROSITE is a database of protein families and domains.  It consists of entries describing the protein families, domains, and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation (now in UniProt).

PRINTS, by The University of Manchester, is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family; its diagnostic power is refined by iterative scanning of a Swiss-Prot/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space.  Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes.

NCBI (National Center for Biotechnology Information) protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank and RefSeq, as well as records from SwissProt, Protein Information Resource (PIR), Protein Research Foundation (PRF), and Protein Data Bank (PDB).

 

Protein structure databases

Protein Data Bank (PDB) archive for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids is managed by the Worldwide PDB (wwPDB) Organization.   The data are freely accessible on the Internet via the websites of the member organizations:

Protein Data Bank in Europe (PDBe).

Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), USA.

Protein Data Bank Japan (PDBj).

Biological Magnetic Resonance Data Bank (BMRDB), University of Wisconsin.

Electron Microscopy Data Bank (EMDB), first founded at EMBL-EBI.

AlphaFold Protein Structure Database, developed by DeepMind and EMBL-EBI, provides open access to protein structure predictions for the human proteome and other key proteins of interest.  It depends on an AI system that predicts a protein’s 3D structure from its amino acid sequence.

ModelArchive, by the Swiss Institute of Bioinformatics and Biozentrum, University of Basel, contains structures generated using computational methods (in silico structures).

CATH (Class Architecture Topology Homology) database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created and is continued by the Orengo group at University College London (UCL).

SCOP & SCOP2: Structural Classification of Proteins database, maintained by the Laboratory of Molecular Biology in Cambridge, England.

SCOPe: Structural Classification of Proteins — extended, developed at the Berkeley Lab and University of California, Berkeley.

Database of Macromolecular Movements describes the motions that occur in proteins and other macromolecules, particularly using movies. Associated with it are a variety of free software tools and servers for structural analysis.

Jena Library of Biological Macromolecules (JenaLib) is aimed at a better dissemination of information on three-dimensional biopolymer structures with an emphasis on visualization and analysis.  It provides access to all structure entries deposited at the Protein Data Bank (PDB) or at the Nucleic Acid Database (NDB).  In addition, basic information on the architecture of biopolymer structures is available.

ModBase is a database of comparative protein structure models.

OCA is a browser-database for protein structure/function.  It integrates information from several databases.

OPM (Orientations of Proteins in Membranes) database provides spatial positions of membrane protein structures with respect to the lipid bilayer.

PDBsum (Pictorial database of 3D structures in the Protein Data Bank).  It is a pictorial database that provides an at-a-glance overview of the contents of each 3D structure deposited in the Protein Data Bank (PDB). It shows the molecule(s) that make up the structure (i.e., protein chains, DNA, ligands, and metal ions) and schematic diagrams of the interactions between them.

PDBTM is a comprehensive a transmembrane protein selection of the Protein Data Bank (PDB).

Proteopedia is a wiki, user-friendly collaborative and free 3D-encyclopedia of proteins and other biomolecules.

ProtCID (Protein Common Interface Database) is a database of similar protein-protein interfaces in crystal structures of homologous proteins.

NIH protein database.

Protein Lounge contains multiple databases, pathways, animations, and tutorials (paid subscription).

SWISS-MODEL Repository is a database of annotated 3D protein structure models calculated by homology modeling.

ModBase is a database of comparative protein structure models.

SIMAP (Similarity Matrix of Proteins) is a database of protein sequence similarities and functional annotations.

 

Protein interactions databases

BioGRID (Biological General Repository for Interaction Datasets) is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications.

RBPDB (RNA-binding Proteins Database) is a biological database of RNA-binding protein specificities that includes experimental observations of RNA-binding sites.

DIP (Database of Interacting Proteins) is a biological database which catalogs experimentally determined interactions between proteins.  It combines information from a variety of sources to create a single, consistent set of protein–protein interactions.

IntAct (Molecular Interaction Database) provides a free, open-source database system and analysis tools for molecular interaction data.

Complex Portal is a manually curated, encyclopedic resource of macromolecular complexes from a number of key model organisms. Most complexes are made up of proteins but may also include nucleic acids or small molecules.

 

Protein expression databases

The Human Protein Atlas is a Swedish-based program initiated in 2003 with the aim to map all the human proteins in cells, tissues, and organs using an integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics, and systems biology.

 

See all Databases