Szamosi Jake C., Forbes Jessica D., Copeland Julia K., Knox Natalie C., Shekarriz Shahrokh, Rossi Laura, Graham Morag, Bonner Christine, Guttman David S., Van Domselaar Gary, Surette Michael G., Bernstein Charles N.
Front. Microbiol., 21 August 2020 Sec. Systems Microbiology Volume 11 – 2020. doi: 10.3389/fmicb.2020.02028
PMID: 32973734
Background: In studies evaluating the microbiome, numerous factors can contribute to technical variability. These factors include DNA extraction methodology, sequencing protocols, and data analysis strategies. We sought to evaluate the impact these factors have on the results obtained when the sequence data are independently generated and analyzed by different laboratories.
Methods: To evaluate the effect of technical variability, we used human intestinal biopsy samples resected from individuals diagnosed with an inflammatory bowel disease (IBD), including Crohn’s disease (n = 12) and ulcerative colitis (n = 10), and those without IBD (n = 10). Matched samples from each participant were sent to three laboratories and studied using independent protocols for DNA extraction, library preparation, targeted-amplicon sequencing of a 16S rRNA gene hypervariable region, and processing of sequence data. We looked at two measures of interest – Bray–Curtis PERMANOVA R2 values and log2 fold-change estimates of the 25 most-abundant taxa – to assess variation in the results produced by each laboratory, as well the relative contribution to variation from the different extraction, sequencing, and analysis steps used to generate these measures.
Results: The R2 values and estimated differential abundance associated with diagnosis were consistent across datasets that used different DNA extraction and sequencing protocols, and within datasets that pooled samples from multiple protocols; however, variability in bioinformatic processing of sequence data led to changes in R2 values and inconsistencies in taxonomic assignment and abundance estimates.
Conclusion: Although the contribution of DNA extraction and sequencing methods to variability were observable, we find that results can be robust to the various extraction and sequencing approaches used in our study. Differences in data processing methods have a larger impact on results, making comparison among studies less reliable and the combined analysis of bioinformatically processed samples nearly impossible. Our results highlight the importance of making raw sequence data available to facilitate combined and comparative analyses of published studies using common data processing protocols. Study methodologies should provide detailed data processing methods for validation, interpretability, reproducibility, and comparability.