Journal Search Engine

ISSN : 1226-7155(Print)
ISSN : 2287-6618(Online)

International Journal of Oral Biology Vol.46 No.3 pp.146-153
DOI : https://doi.org/10.11620/IJOB.2021.46.3.146

Trimming conditions for DADA2 analysis in QIIME2 platform

Seo-Young Lee¹, Yeuni Yu¹, Jin Chung², Hee Sam Na²*

¹Interdisplinary Program of Genomic Science, Pusan National University, Yangsan 50612, Republic of Korea
²Department of Oral Microbiology, School of Dentistry, Pusan National University, Yangsan 50612, Republic of Korea

^*Correspondence to: Hee Sam Na, E-mail: heesamy@pusan.ac.kr

Received August 27, 2021 Review September 9, 2021 Accepted September 9, 2021

Abstract

Accurate identification of microbes facilitates the prediction, prevention, and treatment of human diseases. To increase the accuracy of microbiome data analysis, a long region of the 16S rRNA is commonly sequenced via paired-end sequencing. In paired-end sequencing, a sufficient length of overlapping region is required for effective joining of the reads, and high-quality sequencing reads are needed at the overlapping region. Trimming sequences at the reads distal to a point where sequencing quality drops below a specific threshold enhance the joining process. In this study, we examined the effect of trimming conditions on the number of reads that remained after quality control and chimera removal in the Illumina paired-end reads of the V3–V4 hypervariable region. We also examined the alpha diversity and taxa assigned by each trimming condition. Optimum quality trimming increased the number of good reads and assigned more number of operational taxonomy units. The pre-analysis trimming step has a great influence on further microbiome analysis, and optimized trimming conditions should be applied for Divisive Amplicon Denoising Algorithm 2 analysis in QIIME2 platform.

Key Words : Microbiota , Electronic data processing , 16S ribosomal RNA

초록

키워드 :

This article has been cited by 0 article in crossref

Cited-By

Funding:

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

Microbial community structure can provide valuable insights not only into the workings of natural ecosystems, but increasingly into the relationship between the human host and its bacterial colonizers [1]. Rapid progress in DNA sequencing technology has provided ever-increasing outputs coupled with lowered costs, facilitating an explosion in sequencing studies [2].

Next-generation sequencing (NGS) can produce a vast number of sequences at unprecedented speed. The amplicon analysis of the 16S rRNA genes is commonly used method for microbiome studies, which has been used in big projects, such as the Human Microbiome Project [3]. The 16S rRNA gene is used as it is highly conserved among microorganisms and contains hypervariable regions that have sufficient variation to allow distinction at individual taxonomic levels [4]. The basic idea is to amplify a selected region of 16S rRNA gene using the polymerase chain reaction, sequence the amplified product, and compare the sequence with a reference database [5].

There are several factors that influences the accuracy of the results of amplicon analysis of the 16S rRNA genes such as the sequenced target region [6], the sequencing technology [7], and the reference database used for the analysis [8]. Among various strategies to increase the accuracy, sequencing a longer region of the 16S rRNA gene is most commonly used ap- proach as longer reads provide more information for taxonomy assignments [9,10]. To obtain sufficient length of sequencing reads, paired-end sequencing approach is frequently used [11]. In paired-end sequencing, sufficient length of overlapping region is required for the reads to be effectively joined and. high-quality sequencing reads are required at the overlapping region. However, one of the major problem with sequencing is the increased error rate along the distal ends [12], which adversely affects the joining step, leading to the failure of the joining. To reduce this problem, sequences are trimmed at the reads distal to a point where sequencing quality drops below a specific threshold.

QIIME is a pipeline for microbiome analysis, which is now the standard for microbiome analysis for the analysis of Illumina paired-end reads [13]. In QIIME, there are several options for microbiome analysis including DADA2 (Divisive Amplicon Denoising Algorithm 2), Vsearch and Deblur. DADA2 can perform denoising, and dereplication, of paired-end sequences including chimera removal and trimming of reads based on positional quality scores [14].

Although running DADA2 is relatively easy compared to other more complex NGS data, there are limited studies that reports optimum trimming condition for efficient microbiome analysis [15]. In this study, we studied the effect of trimming conditions in QIIME2 platform on microbiome analysis for oral microbiome study.

Materials and Methods

1. Sample data

Sequencing data set consisted of 90 human oral samples, downloaded from the SRA database (PRJNA625671) (http:// www.ncbi.nlm.nih.gov/sra). These samples were sequenced using Illumina paired-end technology, with a length of 300 bases for each read pair and targeted V3–V4 hypervariable region.

2. QIIME analysis

For QIIME2 analysis, DADA2 was applied to denoise, joined the paired reads, and removed the chimera. DADA2 [14] was originally an R package that is wrapped by the QIIME2 plugin. The first 5 bases from the forward and reverse reads were truncated. Since the length of the predicted targeted region is around 460 bps [16], the total length of forward and reverse reads was trimmed to consist minimum of 470 bps. For forward reads, bases between 240 bps and 290 bps were trimmed. For reverse reads, bases between 180 bps to 240 bps were trimmed. Default values were applied for other parameters.

Depending on trimming conditions, sequence read counts during the process as well as alpha diversity and taxa assigned by each condition was analyzed. For taxa assignment, eHOMD (V15.2 2020.8.29) reference database was used [17].

Results

1. Sequencing quality score and assembled sequences counts during pre-processing

For pre-processing of raw sequencing reads prior to the subsequent analysis, sequencing quality check was performed (Fig. 1). For both forward (F) and reverse (R) reads, there was a decrease in quality near the distal ends. Since the presence of low-quality bases towards the distal end of the sequence adversely affects the joining step, trimming was conducted between 240 to 290 bps in forward reads and between 180 to 250 bps in reverse reads.

During DADA2 analysis, filtered and denoised read count was maximum when each forward reads were joined with shortest reverse reads. For merged and non-chimeric read counts, F240/R240, F250/R230, F260/R220, F270/R210, F280/R200 and F290/R190 showed maximum read counts for each forward read, respectively. Among various trimming conditions, F270/R210 produced maximum read counts (Fig. 2).

2. Alpha diversity and taxonomic assignment

Next, the joined sequences were assigned to determine the taxonomy. When richness and evenness of the microbiome diversity were determined by Chao1 and Shannon index, conditions that produced most non-chimeric read counts also showed the highest Chao1 (Fig. 3A) and Shannon index (Fig. 3B).

To determine if trimming conditions has effect on taxa assignment, relative abundance of top 18 species were plotted (Fig. 4). When conditions with maximum non-chimeric read counts for forward reads was compared, the overall relative abundance showed similar results.

3. Trimming F270 in detail

Since forward reads trimmed at 270 bps produced most non-chimeric read counts, F270 was further trimmed with various reverse reads between 200 bps and 216 bps. When read counts during DADA2 analysis was calculated, F270/R200 passed most read counts after filtered and denoised process. However, there was dramatic read count loss at merged and non-chimeric read counts in through F270/R200 to F270/ R206. Interestingly, there was abrupt merged and non-chimeric read count increase in F270/R208 (Fig. 5A). Also, when alpha diversity was determined, there was significant difference between F270/R206 and F270/R208 (Fig. 5B).

Finally, taxa assigned at this condition was determined. The relative abundance of 15 most abundant species was plotted. There was significant difference between F270/R206 and F270/R208. Although F270/R208 produced most non-chimeric read counts, the relative abundance of some species was different from F270/R210, F270/R212 and F270/R216, which showed similar result within (Fig. 6).

Discussion

Recently, it that been noted that many common diseases are highly associated with the microbiome [18,19]. Although a significant portion (≈30%) of oral microbial species remain uncultivable (http://www.homd.org), oral microbiome is well characterized compared to other microbiome research areas [20,21]. Hundreds of microbial species reside in the human oral cavity, and composition of oral microorganisms differs across individuals [22]. The individual oral microbiomes could facilitate the prediction, prevention, and treatment of oral diseases, and the maintenance of oral health [23].

Analysis of the 16S rRNA genes is the most commonly used method for microbiome studies [24]. As longer reads provide more information for taxonomy assignments [9,10], pairedend sequencing approach is commonly used [25]. For effective joining, high-quality sequencing reads are required at the joining region. However, one of the major problem with sequencing is the increased error rate along the distal ends [12]. To reduce the failure of joining due to low-sequencing quality, sequences are trimmed at the reads distal to a point where sequencing quality drops below a specific threshold [15].

In this study, we studied the effect of trimming conditions in QIIME2 platform on microbiome analysis for oral microbiome study. First, prior to applying DADA2 pipeline in QIIME2, sequencing quality check was performed. Phred scoring has become a standard that base calling software output an error probability, in the form of a quality score, for each base call [26]. For Phred quality score 10, 15 and 20, the probability of incorrect base call are 10%, 3.15% and 1%, respectively. A gradual decrease in quality near the distal ends was observed in both forward and reverse reads. At forward 240 bps, Phred score for 25 percentile which is represented as bottom box was near 20. For forward 290 bps, the 25-percentile score was less than 10. For reverse reads, the sequencing quality was significantly lower compared to the forward read. For reverse 180 bps and 220 bps, the 25-percentile score were 20 and less than 10, respectively. Since the target region is around 460 bps [16] and minimum of 10 bps overlapping is required for the joining, trimming was conducted between 240 to 290 bps in forward reads and between 180 to 250 bps in reverse reads considering the quality score.

Next, reads counts that passed each pre-processing step was calculated. The filtered and denoised read count was maximum when each forward reads were joined with shortest reverse reads. During this step, the reads dependent on the individual single-read sequencing result. For forward reads, although F240 has the highest sequencing quality, the 25-percentile is above 15 for F270. The filtered and denoised read counts reached maximum at F270 group. For reverse reads, the shortest reverse reads which as the highest sequencing quality showed most read counts. Interestingly, the read count was similar when 25-percentile quality score reached near 20 such as R180 and R190. Thus, individual single-read sequencing quality mostly have affected the read counts during filtering and denoising step.

For merged and non-chimeric read counts, F240/R240, F250/R230, F260/R220, F270/R210, F280/R200 and F290/ R190 showed maximum read counts for each forward read, respectively. The predicted target size for V3-V4 region is 464 bps [16] and minimum overlapping conditions only provided 6 bps to overlap. Thus, although shorter reverse reads have higher sequencing quality, the merged read count did not show maximum read counts. For forward reads, F270 produced the maximum merged read count suggesting 25-percentile score around 20 can be considered as the threshold for forward read trimming site. Taken together, maximum merged read count was observed when the overlapping region was 16 bps and 25-percentile forward read sequencing quality was near 20.

When the joined sequences were assigned to determine the taxonomy, trimming conditions that produced most nonchimeric read counts also showed the highest alpha diversity. To determine if trimming conditions has effect on taxa assignment, the overall relative abundance showed similar results. Although, relatively abundant species showed similar relative proportion, assigning more read counts can provide more chance to determine minor species that can contribute to disease development or can serve as biomarker for the disease.

Finally, detailed trimming conditions was studied focusing on F270. At preprocessing step, there was dramatic merge read count increase between F270/R206 and F270/R208. Since there is very small difference in sequencing quality, the overlapping length could be the main reason for this increase. The overlapping bps is 12 bps for F270/R206 and 14 bps for F270/ R208. When taxa assigned at this condition was determined, there was significant difference between F270/R206 and F270/R208. Although F270/R208 produced most non-chimeric read counts, this specific condition is very hard to capture in practice. The relative abundance reached similar among F270/ R210, F270/R212 and F270/R216. Taken together, the recommended trimming site for forward reads is the maximum length where 25-percentile sequencing quality is near 20 and minimum overlapping length is more than 16 bps.

In conclusion, we compared the trimming conditions for microbiomes analysis using DADA2 in QIIME2 for oral samples. Our result show that trimming site for forward reads is where 25-percentile sequencing quality score of 20 is maintained and minimum overlapping length should be more than 16 bps.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, NRF-2017R1D1A1B03028710).

Figure

Fig. 1.

Representative box plot of quality scores over positions in sequenced reads generated using a random sampling of 10,000 out of 9,162,071 sequences without replacement. Quality distributions are shown for (A) forward reads and (B) reverse reads. The upper whisker, top of box, middle box, bottom box, and lower whisker represent 91th, 75th, 50th, 25th, and 9th percentile quality score, respectively.

Fig. 2.

Effect of trimming conditions on pre-processing. Average number of read counts that passed during filtered, denoised, merged and chimeric remove process.

Fig. 3.

Effect of trimming conditions on microbiome alpha diversity. Microbial richness and evenness was determined by Chao1 (A) and Shannon index (B).

Fig. 4.

Effect of trimming conditions on relative microbial abundance at species level. Average relative abundance of 18 most abundant species were plotted.

Fig. 5.

Effect of trimming forward 270 and various reverse read conditions. (A) Average number of read counts that passed during filtered, denoised, merged and chimeric remove process. (B) Microbial richness was determined Chao1 index.

Fig. 6.

Effect of trimming forward 270 and various reverse read conditions on relative microbial abundance at species level. Average relative abundance of 15 most abundant species were plotted.

Table

Reference

Greenhalgh K, Meyer KM, Aagaard KM, Wilmes P. The human gut microbiome in health: establishment and resilience of microbiota over a lifetime. Environ Microbiol 2016;18:2103-16.
Panek M, Čipčić Paljetak H, Barešić A, Perić M, Matijašić M, Lojkić I, Vranešić Bender D, Krznarić Ž, Verbanac D. Methodology challenges in studying human gut microbiota – effects of collection, storage, DNA extraction and next generation sequencing technologies. Sci Rep 2018;8:5143.
NIH HMP Working Group, Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, Baker CC, Di Francesco V, Howcroft TK, Karp RW, Lunsford RD, Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R, Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J, Little AR, Peavy H, Pontzer C, Portnoy M, Sayre MH, Starke- Reed P, Zakhari S, Read J, Watson B, Guyer M. The NIH Human Microbiome Project. Genome Res 2009;19:2317-23.
Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods 2007; 69:330-9.
Clarridge JE 3rd. Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clin Microbiol Rev 2004;17:840-62, table of contents.
Bukin YS, Galachyants YP, Morozov IV, Bukin SV, Zakharenko AS, Zemskaya TI. The effect of 16S rRNA region choice on bacterial community metabarcoding results. Sci Data 2019;6: 190007.
Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M. Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012;2012:251364.
Balvočiūtė M, Huson DH. SILVA, RDP, Greengenes, NCBI and OTT - how do these taxonomies compare? BMC Genomics 2017;18(Suppl 2):114.
Johnson JS, Spakowicz DJ, Hong BY, Petersen LM, Demkowicz P, Chen L, Leopold SR, Hanson BM, Agresta HO, Gerstein M, Sodergren E, Weinstock GM. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 2019;10:5029.
Jeong J, Yun K, Mun S, Chung WH, Choi SY, Nam YD, Lim MY, Hong CP, Park C, Ahn YJ, Han K. The effect of taxonomic classification by full-length 16S rRNA sequencing with a synthetic long-read technology. Sci Rep 2021;11:1727.
Allaband C, McDonald D, Vázquez-Baeza Y, Minich JJ, Tripathi A, Brenner DA, Loomba R, Smarr L, Sandborn WJ, Schnabl B, Dorrestein P, Zarrinpar A, Knight R. Microbiome 101: studying, analyzing, and interpreting gut microbiome data for clinicians. Clin Gastroenterol Hepatol 2019;17:218-30.
Cox MP, Peterson DA, Biggs PJ. SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010;11:485.
Hall M, Beiko RG. 16S rRNA gene analysis with QIIME2. Methods Mol Biol 2018;1849:113-29.
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581-3.
Mohsen A, Park J, Chen YA, Kawashima H, Mizuguchi K. Impact of quality trimming on the efficiency of reads joining and diversity analysis of Illumina paired-end reads in the context of QIIME1 and QIIME2 microbiome analysis frameworks. BMC Bioinformatics 2019;20:581.
Klindworth A, Pruesse E, Schweer T, Peplies J, Quast C, Horn M, Glöckner FO. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing- based diversity studies. Nucleic Acids Res 2013;41:e1.
Dewhirst FE, Chen T, Izard J, Paster BJ, Tanner AC, Yu WH, Lakshmanan A, Wade WG. The human oral microbiome. J Bacteriol 2010;192:5002-17.
Wei LQ, Cheong IH, Yang GH, Li XG, Kozlakidis Z, Ding L, Liu NN, Wang H. The application of high-throughput technologies for the study of microbiome and cancer. Front Genet 2021;12: 699793.
Lynch SV, Pedersen O. The human intestinal microbiome in health and disease. N Engl J Med 2016;375:2369-79.
Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, Brady A, Creasy HH, McCracken C, Giglio MG, McDonald D, Franzosa EA, Knight R, White O, Huttenhower C. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 2017;550:61-6.
Gao L, Xu T, Huang G, Jiang S, Gu Y, Chen F. Oral microbiomes: more and more importance in oral cavity and whole body. Protein Cell 2018;9:488-500.
Wade WG. The oral microbiome in health and disease. Pharmacol Res 2013;69:137-43.
Krishnan K, Chen T, Paster BJ. A practical guide to the oral microbiome and its relation to health and disease. Oral Dis 2017;23:276-86.
Nash AK, Auchtung TA, Wong MC, Smith DP, Gesell JR, Ross MC, Stewart CJ, Metcalf GA, Muzny DM, Gibbs RA, Ajami NJ, Petrosino JF. The gut mycobiome of the Human Microbiome Project healthy cohort. Microbiome 2017;5:153.
Na HS, Kim SY, Han H, Kim HJ, Lee JY, Lee JH, Chung J. Identification of potential oral microbial biomarkers for the diagnosis of periodontitis. J Clin Med 2020;9:1549.
Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998;8: 186-94.