High-throughput genome sequencing is changing the way we practice medicine by providing new capabilities to identify the genetic causes of disease. DNA sequencing applications often rely on selected databases for analysis and interpretation of results. During this process, however, it is almost impossible to avoid small amounts of DNA that do not belong to the organism of interest. These contaminants come from a variety of sources, including laboratory personnel, reagents used, and even the samples themselves. When the test samples are human, microbial contaminants can be interpreted as infectious agents. Conversely, when from bacteria, human genome contaminants may have previously unwittingly assembled into the reference bacterial genome sequence and thus become a misleading source when such structures are found in subsequent studies.
A new study (Sci Rep 2022; 12:9863-9863) illustrates how this happens. Unassembled DNA sequencing “reads” were collected from almost 5,000 people and scanned to identify viruses, bacteria and archaea. Those that do not match the human genome have been collected and analyzed to obtain a picture of the human “contaminome”, which can be divided into three categories (Figure 1A): viral reads associated with the human virome (eg, the collection of all viruses found in humans); bacterial or viral readings introduced by sample collection (eg, normal microbiota at the sample site) and manipulation, propagation of cell lines or laboratory reagents and kits used for sequencing (eg, experimental contaminants); and bacterial read mismatch due to human sequence contamination in bacterial genome databases (ie, computational contamination).
Figure 1: Understanding the human pollutant
The danger of the third category is that it can lead to false associations between microbes and disease. This study illustrates this with an intriguing finding: after identifying all the bacterial reads present in 5,000 samples of human DNA, they found more than 50 bacteria that were significantly more common in men than in women (gender is binary). Rather than jump to the conclusion that these results reflect actual bacterial infections that are more common in men, the authors instead asked what would happen if the bacterial genomes were contaminated with fragments of the human Y chromosome (Figure 1B). In this case, the sequences obtained from this chromosome would match (incorrectly) those of the bacterial genomes. Supporting this hypothesis was the status of Y-chromosome sequencing, which as of earlier this year was still inconclusive. 77,647 short DNA sequences were identified from reads that matched bacterial genomes that were significantly more common among males.
Fortunately, the last sequence of the Y chromosome was published this year. 77,647 “bacterial” sequences identified by the study were aligned to the Y chromosome sequence, and 73,691 of them (95%) were found to match, indicating that these sequences were in fact human, confirming the previous hypothesis. This result highlights the need to be cautious in interpreting the results of large DNA sequencing projects and raises the question of whether the reported associations between certain types of microbes and cancer, blood or autoimmune diseases still exist. Could some of these be artifacts of computational contaminants?
The problem of human sequence contamination in bacterial genomes extends beyond the Y chromosome. More than 3,000 microbial genomes have been reported to contain small human fragments. This complicates the clinical use of microbial genomes in the diagnosis of infectious diseases, where sequencing of human samples is the basis for pathogen identification. This method involves comparing the DNA or RNA sequences of a patient sample with those of all known microbial genomes (viruses, bacteria, fungi and parasites) to identify the cause of the infection. In this context, distinguishing microbial readings associated with the true pathogen from contaminants is essential to avoid misdiagnosis. Despite the best efforts of researchers, computational contaminants can affect even the most robust databases.
In addition to computational contaminants, experimental contaminants can be difficult to discern, especially for low-biomass samples (typically containing a small fraction of the host’s microbial DNA), such as blood and CSF. Some of the most common experimental contaminants come from known pathogens, including staphylococci, pseudomonas, and mycobacterial species, to name a few. Fortunately, well-designed experimental controls can be applied to detect them, provided they are known to exist.
The reported results illustrate how each human genome sequencing project captures a variety of life forms, including DNA sequences from bacteria and viruses, and how spurious associations between infectious diseases and traits (such as gender) can arise. This work highlights the obligation to have complete and accurate genome sequences to avoid computational contamination of reference sequences and improve diagnostic accuracy. It also highlights the need for standard protocols to identify the “contaminoma” to ensure the fidelity of sequencing-based diagnostics and testing.
The human “contaminome” and understanding infectious diseases
Patricia J. Simner, Ph.D., and Steven L. Salzberg, Ph.D.
Department of Pathology, Division of Medical Microbiology (PJS), Department of Medicine, Division of Infectious Diseases (PJS) and Department of Biomedical Engineering (SLS), Johns Hopkins School of Medicine, Department of Computer Science and Center for Computational Biology, Whiting School of Engineering (SLS) and the Department of Biostatistics, Bloomberg School of Public Health (SLS), Johns Hopkins University, Baltimore.
N Engl J Med 2022; 387:943-946