The following glossary gives some bioinformatics terms frequently encountered by researchers and readers.
1KGP (The 1000 Genomes Project) is a collaboration among research groups in the US, UK, China and Germany that ran between 2008 and 2015 to produce an extensive catalog of human genetic variation.
23andMe is a publicly held personal genomics and biotechnology company best known for providing a direct-to-consumer genetic testing service to generate reports relating to the customer’s ancestry and genetic predispositions to health-related topics.
ABE = Adenine (Adenosine) Base Editor.
Accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ).
AI = Artificial Intelligence.
Allele is one of two, or more, forms of a given gene variant.
Allelic heterogeneity is the phenomenon in which different mutations at the same locus lead to the same or very similar phenotypes, gene polymorphism.
Aneuploidy is the presence of an abnormal number of chromosomes in a cell, for example a human cell having 45 or 47 chromosomes instead of the usual 46.
Base calling is the process of assigning nucleobases to chromatogram peaks or electrical current changes resulting from the nucleic acid sequencer.
Biocuration is the organization, representation and making accessible of biological data to both humans and computers.
BioGRID (Biological General Repository for Interaction Datasets) is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications.
BioGRID ORCS (BioGRID Open Repository of CRISPR Screens) is a curated repository of CRISPR screens.
Bioinformatics is an interdisciplinary field of science, which combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data.
Biological Databases are collections of data that are organized so that their contents can easily be accessed, managed, and updated.
BioMart is a freely available, open-source, federated database system that provides unified access to disparate, geographically distributed data sources. It allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects.
BLAST = Basic Local Alignment Search Tool. It finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm.
BMRDB = Biological Magnetic Resonance Data Bank.
CASP = Critical Assessment of Protein Structure Prediction. It is a worldwide competition for protein structure prediction taking place every two years since 1994.
CATH (Class Architecture Topology Homology) is a protein structure classification database.
CBE = Cytosine Base Editor.
cDNA = complementary DNA.
Central dogma of molecular biology is an explanation of the flow of genetic information within a biological system. It is often stated as “DNA makes RNA, and RNA makes protein”.
Centromere is the specialized DNA sequence of a chromosome that links a pair of sister chromatids (a dyad). During mitosis, spindle fibers attach to the centromere.
ceRNA = competing endogenous RNA, compete with mRNA for the same miRNA thus regulating the ability of miRNA to inhibit mRNA from being translated into proteins.
cfDNA = cell-free DNA.
Chromatids (sister chromatids) are the identical copies formed by the DNA replication, with both copies joined together by a common centromere to form a chromosome. In other words, a sister chromatid may also be said to be ‘one-half’ of the duplicated chromosome. The two sister chromatids are separated from each other into two different cells during mitosis or during the second division of meiosis.
Chromatin is the mass of DNA and proteins that condense to form chromosomes.
Chromosomes are long DNA molecules with part or all the genetic material of an organism. Most eukaryotic chromosomes include packaging proteins called histones which, aided by chaperone proteins, bind to and condense the DNA molecule to maintain its integrity.
circRNA = circular RNA.
CNCB = China National Center for Bioinformation, includes the National Genomics Data Center (NGDC).
Comparative modeling = homology modeling.
Complex Portal is a manually curated, encyclopedic resource of macromolecular complexes from a number of key model organisms.
Consensus sequence is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment; a DNA sequence common to different organisms and having a similar function in each.
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals.
CRISPR = Clustered Regularly Interspaced Short Palindromic Repeats, a family of DNA sequences found in the genomes of prokaryotic organisms.
CRISPR screening is a large-scale genetic loss-of-function experimental approach designed to find a small number of important genes or genetic sequences within a massive number of genetic sequences such as the entire genome.
Crossing over (chromosomal crossover) is the exchange of genetic material during sexual reproduction between two homologous chromosomes that results in recombinant chromosomes. Crossover usually occurs when matching regions on matching chromosomes break and then reconnect to the other chromosome.
ctDNA = circulating tumor DNA.
ddPCR = droplet digital PCR (Polymerase Chain Reaction).
DIP (Database of Interacting Proteins) is a biological database which catalogs experimentally determined interactions between proteins.
DisProt is a manually curated biological database of intrinsically disordered proteins (IDPs). See Biocuration
DNA sequencing is the process of determining the nucleic acid sequence (the types and order of nucleotides in DNA).
dPCR = digital PCR (Polymerase Chain Reaction).
dsRNA = double-stranded RNA.
Ecogenomics is the study of genetic material recovered directly from environmental samples, environmental genomics, metagenomics.
EMBL (European Molecular Biology Laboratory) is a molecular biology research institution supported by the member states. It is an intergovernmental organization, created in 1974.
EMBL-EBI (European Bioinformatics Institute) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics.
EMDB = Electron Microscopy Data Bank.
ENA (European Nucleotide Archive) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC).
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.
Epigenetics is the science that studies heritable changes caused by the activation and deactivation of genes without any change in the underlying DNA sequence of the organism.
Epigenome is the complete description of all the chemical modifications to DNA and histone proteins that regulate the expression of genes within the genome.
Epigenomics is the study of the complete set of chemical modifications on the genetic material and histone proteins of a cell.
Euploidy is the state of a cell or organism having one or more than one set of the same set of chromosomes, possibly excluding the sex-determining chromosomes. Euploid karyotypes would consequentially be a multiple of the haploid number, which in humans is 23.
exposome is a concept used to describe environmental exposures that an individual encounters throughout life, and how these exposures impact biology and health. It encompasses both external and internal factors, including chemical, physical, biological, and social factors that may influence human health.
FAST5 is a file format for the storage of nucleotide and protein sequences based on HDF5.
FASTA is a DNA and protein sequence alignment software.
FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
FGED = Functional Genomics Data Society, 1999-2021
gDNA = genomic DNA.
GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database.
Gene is a basic unit of heredity and a sequence of nucleotides in DNA that encodes the synthesis of a gene product, either RNA or protein.
Gene calling is the process of identifying regions of genomic DNA that encode genes.
Gene drive is a genetic element that introduces a bias in the relative chance of inheritance between distinct versions of a set of genes, enabling one to spread rapidly in a population at the expense of others even if it is disadvantageous to the organism. Gene drives could be exploited for the genetic modification of whole populations, such as disease-carrying insects.
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect.
Gene Ontology (GO) initiative provides a computational representation of our current scientific knowledge about gene products (molecular function, cellular location, and the biological process) in a way that is unified across all species.
Gene polymorphism is the existence of mutations at the same locus that lead to the same or very similar phenotypes, allelic heterogeneity.
Gene silencing is the regulation of gene expression during either transcription or translation to prevent the expression of a certain gene.
Genetic genealogy is the use of DNA testing to establish relationships between individuals and determine ancestry.
Genetics is a branch of biology concerned with the study of genes, genetic variation, and heredity in organisms.
Genome is an organism’s complete set of DNA, including all of its genes as well as it’s hierarchical, three-dimensional structural configuration. The genome includes both the genes (the coding regions) and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. It consists of nucleotide sequences of DNA (or RNA in RNA viruses).
Genome annotation is the identification of the locations of genes on raw DNA sequences and determining what those genes do.
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes.
Genotype is the set of genetic material, often used to refer to a single gene or set of genes, such as the genotype for eye color.
GEO = Gene Expression Omnibus, a public functional genomics data repository.
GIX (Gene Information eXtension) is a browser extension that allows retrieving information about a gene product directly on any webpage by double clicking an official gene name, synonym or supported accession.
GO terms and relations are vocabulary used in the Gene Ontology project.
gRNA = guide RNA, component of the CRISPR-Cas9 gene editing system.
Haplotype (haploid genotype) is a group of alleles that are inherited together from a single parent.
HapMap (short for “haplotype map”) is the nickname of the International HapMap Project, conducted in great intensity from about 2003 to 2006.
HDF5 = Hierarchical Data Format version 5, an open-source file format that supports large, complex, heterogeneous data. Data are stored in groups and datasets in the file.
HMDB = The Human Metabolome Database.
hnRNA = heterogeneous nuclear ribonucleic acid, collective term for the primary RNA transcripts and all intermediates in the synthesis of mature mRNA molecules in the nucleus.
Homologous chromosomes are the maternal and the paternal chromosomes that pair up with each other inside a cell during fertilization.
Homology modeling, also called comparative modeling or template-based modeling (TBM), refers to prediction of a target protein 3D structure using a known experimental structure of a homologous protein (the template).
Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying and mapping all the genes of the human genome from both a physical and a functional standpoint.
Human Protein Atlas is a program that aims to map all the human proteins in cells, tissues, and organs.
Indel = insertion or deletion of bases in the genome of an organism.
INSDC = International Nucleotide Sequence Database Collaboration. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data are synchronized on a daily basis.
IntAct (Molecular Interaction Database) is a free, open-source database system and analysis tools for molecular interaction data.
InterPro is a database resource that unifies the data from the consortium members and provides functional analysis of protein sequences by classifying them into families.
JenaLib = Jena Library of Biological Macromolecules.
KEGG = Kyoto Encyclopedia of Genes and Genomes.
lncRNA (long non-coding RNA) is a transcript more than 200 nucleotides that is not translated into protein.
Locus is a specific, fixed position on a chromosome where a particular gene or genetic marker is located.
Locus heterogeneity occurs when mutations at multiple genomic loci can produce the same phenotype, and each individual mutation is sufficient to cause the specific phenotype independently.
Meiosis (reductional division) is a special type of cell division of germ cells in sexually reproducing organisms used to produce the gametes, such as sperm or egg cells. It involves two rounds of division that ultimately result in four cells with only one copy of each chromosome (haploid).
Metabolome is the global collection of all low molecular weight metabolites that are produced by cells during metabolism, the complete set of small-molecule chemicals found within a biological sample.
Metabolomics is the scientific study of the set of metabolites present within an organism, cell, or tissue.
Metagenomics is the study of a collection of genetic material (genomes) from a mixed community of organisms, usually referring to the study of microbial communities. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics.
MIAME = Minimum Information About a Microarray Experiment, a standard created by the FGED Society for reporting microarray experiments.
Microbiome is all the microorganisms in a particular environment (including the body or a part of the body).
miRNA (microRNA) is a small single-stranded non-coding RNA molecule (containing about 22 nucleotides) that functions in RNA silencing and post-transcriptional regulation of gene expression.
Mitosis is a part of the cell cycle in which replicated chromosomes are separated into two new nuclei. Cell division gives rise to genetically identical cells in which the total number of chromosomes is maintained. Therefore, mitosis is also known as equational division.
MobiDB is a database of intrinsic protein disorder annotation.
ModBase is a database of comparative protein structure models.
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity, including molecular synthesis, modification, mechanisms, and interactions. It sits at the intersection of biochemistry and genetics.
MRC = Medical Research Council, England.
mtDNA = mitochondrial DNA.
Multiomics is a biological analysis approach in which the data sets are multiple “omes”, such as the genome, proteome, transcriptome, epigenome, metabolome, and microbiome (i.e., a meta-genome and/or meta-transcriptome, depending upon how it is sequenced), in other words, the use of multiple omics technologies to study life in a concerted way.
Multiplex RT-PCR = Multiplex Reverse Transcription- Polymerase Chain Reaction, amplification of different nucleic acid targets in the same reaction.
Mutagen is a physical or chemical agent that permanently changes genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level.
Mutation is an alteration in the nucleotide sequence of the genome.
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype.
NCBI = National Center for Biotechnology Information, part of the United States NLM.
ncRNA (non-coding RNA) is RNA that does not encode a protein product, not translated to a protein.
NDB = Nucleic Acid Database.
neXtProt is a comprehensive human-centric protein discovery platform.
NGDC (National Genomics Data Center) is part of the China National Center for Bioinformation (CNCB). It provides open access to a suite of data resources.
NGS = Next-generation sequencing, high-throughput sequencing of DNA.
NIH = National Institutes of Health, the primary agency of the United States government responsible for biomedical and public health research.
NLM = National Library of Medicine, branch of the United States NIH.
OCA is a browser-database for protein structure/function.
Omics Discovery Index is a web site that can be used to browse and search several biological databases.
OMIM (Online Mendelian Inheritance in Man) is an online daily updated catalogue of human genes and genetic disorders, with a particular focus on the gene-phenotype relationship.
OPM = Orientations of Proteins in Membranes, a database that provides spatial positions of membrane protein structures with respect to the lipid bilayer.
Orthologous gene is a gene in different species that evolved from a common ancestor.
Orthology is defined as: Homologous sequences descended from the same ancestral sequence.
OTU = Operational Taxonomic Unit. This is a DNA sequence that shows a given level of similarity among individuals so that they may be considered belonging to one species.
Paralog (paralogue, paralogous gene) is a gene that is related to another gene in the same organism by descent from a single ancestral gene that was duplicated and that may have a different DNA sequence and biological function.
PCR = Polymerase Chain Reaction.
PDB = Protein Data Bank.
PDBe = Protein Data Bank in Europe.
PDBj = Protein Data Bank Japan.
PDBsum = Pictorial database of 3D structures in the Protein Data Bank.
PDBTM is a transmembrane protein selection of the Protein Data Bank (PDB).
Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual.
Pfam is a database of protein families produced by EMBL-EBI.
Pharmacoepigenomics is the study of the roles drugs play in modulating epigenomic modifications.
Pharmacogenetics is the branch of pharmacology concerned with the effect of genetic factors on reactions to drugs.
Pharmacogenomics is the branch of genetics concerned with the way in which an individual’s genetic attributes affect the likely response to therapeutic drugs.
Phenome is the set of phenotypes (physical and biochemical traits) that can be produced by a given organism over the course of development and in response to genetic mutation and environmental influences.
Phenomics is the systematic study of phenotypes.
Phenotype is the observable characteristics or traits of an organism.
PIR = Protein Information Resource.
piRNA = piwi-interacting RNA, a class of small RNAs that is most abundantly expressed in animal germline.
PIR-PSD (Protein Sequence Database) was a protein database produced by PIR, now included in UniProt.
Ploidy is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal genes. Diploid means two sets. The generic term polyploid is often used to describe cells with three or more chromosome sets.
PMC = PubMed Central, a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM).
Polygenic Inheritance (quantitative inheritance) refers to a single inherited phenotypic trait that is controlled by two or more different genes.
PRINTS is a compendium of protein fingerprints, produced by The University of Manchester.
PROSITE is a database of protein families and domains, produced by SIB.
ProtCID = Protein Common Interface Database, a database of similar protein-protein interfaces in crystal structures of homologous proteins.
Proteome is the entire complement of proteins found in an organism over its entire life cycle, or in a particular cell type at a particular time under defined environmental conditions.
Proteomics is the study of the functions, structures, and interactions of proteins, the study of the proteome.
Proteopedia is a wiki, collaborative 3D-encyclopedia of proteins and other biomolecules.
qPCR = quantitative (real-time) Polymerase Chain Reaction.
RBP = RNA-Binding Protein.
RBPDB (RNA-binding Proteins Database) is a biological database of RNA-binding protein specificities.
RCSB PDB = Research Collaboratory for Structural Bioinformatics Protein Data Bank.
RdRp (RNA-dependent RNA polymerase) or RNA replicase is an enzyme that catalyzes the replication of RNA from an RNA template. Specifically, it catalyzes synthesis of the RNA strand complementary to a given RNA template.
RefSeq (NCBI Reference Sequence Database) is a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.
Reverse transcriptase (RT) is an enzyme used to generate complementary DNA (cDNA) from an RNA template, a process termed reverse transcription.
Reverse transcription is DNA being synthesized using an RNA template.
RNA-induced silencing complex (RISC) is a multiprotein complex, specifically a ribonucleoprotein, which functions in gene silencing via a variety of pathways at the transcriptional and translational levels.
RNA interference (RNAi) is a biological process in which RNA molecules are involved in sequence-specific suppression of gene expression by double-stranded RNA, through translational or transcriptional repression.
RNA replication is RNA being copied from RNA.
RNA-Seq = RNA sequencing.
RT-PCR = Reverse Transcription- Polymerase Chain Reaction.
RT-qPCR = Reverse Transcription-quantitative (real-time) Polymerase Chain Reaction.
saRNAs = small activating RNAs, are small double-stranded RNAs (dsRNAs) that target gene promoters to induce transcriptional gene activation in a process known as RNA activation (RNAa).
SCOP = Structural Classification of Proteins, a database.
SCOPe = Structural Classification of Proteins — extended.
scRNA-Seq = single-cell RNA sequencing.
Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
SIB = Swiss Institute of Bioinformatics.
SIMAP (Similarity Matrix of Proteins) is a database of protein sequence similarities and functional annotations.
siRNA (small interfering RNA), sometimes known as short interfering RNA or silencing RNA, is a class of non-coding double-stranded RNA molecules, typically 20-24 base pairs in length, and operating within the RNA interference (RNAi) pathway.
snoRNAs (small nucleolar RNAs) are a class of small RNA molecules that primarily guide chemical modifications of other RNAs, mainly ribosomal RNAs, transfer RNAs and small nuclear RNAs.
SNP (single-nucleotide polymorphism; plural SNPs, pronounced “snips”) is a germline substitution of a single nucleotide at a specific position in the genome. Certain definitions require the substitution to be present in a sufficiently large fraction of the population (1% or more).
snRNA (small nuclear RNA) is a class of small RNA molecules, approximately 150 nucleotides, whose primary function is in the processing of pre-messenger RNA in the eukaryotic nucleus.
snRNP (small nuclear ribonucleoproteins), often pronounced “snurps”, are snRNA associated with a set of specific proteins.
Spatial transcriptomics is a method for assigning cell types (identified by the mRNA readouts) to their locations in the histological sections, and can also be used to determine subcellular localization of mRNA molecules.
Spliceosome is a large ribonucleoprotein (RNP) complex found primarily within the nucleus of eukaryotic cells, assembled from small nuclear ribonucleoproteins (snRNPs). It removes introns from a transcribed pre-mRNA, a process generally referred to as splicing.
SQL = Structured Query Language, a standard language for storing, manipulating and retrieving data in databases.
SRA (Sequence Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the “short reads” generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC).
sRNAs = small RNAs.
SUPERFAMILY database is a database and search platform of structural and functional annotation for all proteins and genomes.
SWISS-MODEL is a fully automated protein structure homology-modelling server.
Swiss-Prot was a protein database produced by SIB and EMBL-EBI, now in UniProtKB.
System biology is the computational and mathematical analysis and modeling of complex biological systems, a biology-based interdisciplinary field of study that focuses on complex interactions within biological systems.
T2T = Telomere to Telomere Consortium.
Tagmentation is a process, in the analysis of DNA, in which double-stranded DNA is cleaved and tagged. It is the initial step in library preparation.
Telomere (Greek = end part) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. They protect the terminal regions of chromosomal DNA from progressive degradation and ensure the integrity of linear chromosomes by preventing DNA repair systems from mistaking the very ends of the DNA strand for a double strand break.
Transcriptome is the collection of all RNA in a particular cell, sometimes only the messenger RNA is meant.
Transcriptomics is the study of an organism’s transcriptome.
Translational medicine is medical research that is concerned with facilitating the practical application of scientific discoveries to the development and implementation of new ways to prevent, diagnose, and treat disease.
Transposase is the enzyme that catalyzes the transposition of a transposon.
Transposome is the set of genetic transpositions (or of the transposases and transposons) in an organism.
Transposons, mobile DNA, or jumping genes are chromosomal segments that can undergo transposition (translocation) in the genome.
TrEMBL (Translated EMBL Nucleotide Sequence Data Library) was a protein database produced by SIB and EMBL-EBI, now in UniProtKB.
tRFs (tRNA-related fragments) are heterogeneous class of small RNAs generated from tRNA.
UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants.
UniParc = UniProt Archive, a UniProt database.
UniProt = Universal Protein Resource databases.
UniProtKB = UniProt Knowledgebase, the main UniProt database.
UniRef = UniProt Reference Clusters, a UniProt database.
Variant calling is the identification of differences or variants in a sequenced genomic DNA when compared to a reference genome.
VEP (Variant Effect Predictor) is a powerful toolset of Ensembl that predicts the functional effects of genomic variants.
WES = Whole exome sequencing.
WGS = Whole genome sequencing.
WTS = Whole transcriptome sequencing.
wwPDB = Worldwide Protein Data Bank Organization.