Introduction
Microbial community structure can provide valuable insights not only into the workings of natural ecosystems, but increasingly into the relationship between the human host and its bacterial colonizers [1]. Rapid progress in DNA sequencing technology has provided ever-increasing outputs coupled with lowered costs, facilitating an explosion in sequencing studies [2].
Next-generation sequencing (NGS) can produce a vast number of sequences at unprecedented speed. The amplicon analysis of the 16S rRNA genes is commonly used method for microbiome studies, which has been used in big projects, such as the Human Microbiome Project [3]. The 16S rRNA gene is used as it is highly conserved among microorganisms and contains hypervariable regions that have sufficient variation to allow distinction at individual taxonomic levels [4]. The basic idea is to amplify a selected region of 16S rRNA gene using the polymerase chain reaction, sequence the amplified product, and compare the sequence with a reference database [5].
There are several factors that influences the accuracy of the results of amplicon analysis of the 16S rRNA genes such as the sequenced target region [6], the sequencing technology [7], and the reference database used for the analysis [8]. Among various strategies to increase the accuracy, sequencing a longer region of the 16S rRNA gene is most commonly used ap- proach as longer reads provide more information for taxonomy assignments [9,10]. To obtain sufficient length of sequencing reads, paired-end sequencing approach is frequently used [11]. In paired-end sequencing, sufficient length of overlapping region is required for the reads to be effectively joined and. high-quality sequencing reads are required at the overlapping region. However, one of the major problem with sequencing is the increased error rate along the distal ends [12], which adversely affects the joining step, leading to the failure of the joining. To reduce this problem, sequences are trimmed at the reads distal to a point where sequencing quality drops below a specific threshold.
QIIME is a pipeline for microbiome analysis, which is now the standard for microbiome analysis for the analysis of Illumina paired-end reads [13]. In QIIME, there are several options for microbiome analysis including DADA2 (Divisive Amplicon Denoising Algorithm 2), Vsearch and Deblur. DADA2 can perform denoising, and dereplication, of paired-end sequences including chimera removal and trimming of reads based on positional quality scores [14].
Although running DADA2 is relatively easy compared to other more complex NGS data, there are limited studies that reports optimum trimming condition for efficient microbiome analysis [15]. In this study, we studied the effect of trimming conditions in QIIME2 platform on microbiome analysis for oral microbiome study.
Materials and Methods
1. Sample data
Sequencing data set consisted of 90 human oral samples, downloaded from the SRA database (PRJNA625671) (http:// www.ncbi.nlm.nih.gov/sra). These samples were sequenced using Illumina paired-end technology, with a length of 300 bases for each read pair and targeted V3–V4 hypervariable region.
2. QIIME analysis
For QIIME2 analysis, DADA2 was applied to denoise, joined the paired reads, and removed the chimera. DADA2 [14] was originally an R package that is wrapped by the QIIME2 plugin. The first 5 bases from the forward and reverse reads were truncated. Since the length of the predicted targeted region is around 460 bps [16], the total length of forward and reverse reads was trimmed to consist minimum of 470 bps. For forward reads, bases between 240 bps and 290 bps were trimmed. For reverse reads, bases between 180 bps to 240 bps were trimmed. Default values were applied for other parameters.
Depending on trimming conditions, sequence read counts during the process as well as alpha diversity and taxa assigned by each condition was analyzed. For taxa assignment, eHOMD (V15.2 2020.8.29) reference database was used [17].
Results
1. Sequencing quality score and assembled sequences counts during pre-processing
For pre-processing of raw sequencing reads prior to the subsequent analysis, sequencing quality check was performed (Fig. 1). For both forward (F) and reverse (R) reads, there was a decrease in quality near the distal ends. Since the presence of low-quality bases towards the distal end of the sequence adversely affects the joining step, trimming was conducted between 240 to 290 bps in forward reads and between 180 to 250 bps in reverse reads.
During DADA2 analysis, filtered and denoised read count was maximum when each forward reads were joined with shortest reverse reads. For merged and non-chimeric read counts, F240/R240, F250/R230, F260/R220, F270/R210, F280/R200 and F290/R190 showed maximum read counts for each forward read, respectively. Among various trimming conditions, F270/R210 produced maximum read counts (Fig. 2).
2. Alpha diversity and taxonomic assignment
Next, the joined sequences were assigned to determine the taxonomy. When richness and evenness of the microbiome diversity were determined by Chao1 and Shannon index, conditions that produced most non-chimeric read counts also showed the highest Chao1 (Fig. 3A) and Shannon index (Fig. 3B).
To determine if trimming conditions has effect on taxa assignment, relative abundance of top 18 species were plotted (Fig. 4). When conditions with maximum non-chimeric read counts for forward reads was compared, the overall relative abundance showed similar results.
3. Trimming F270 in detail
Since forward reads trimmed at 270 bps produced most non-chimeric read counts, F270 was further trimmed with various reverse reads between 200 bps and 216 bps. When read counts during DADA2 analysis was calculated, F270/R200 passed most read counts after filtered and denoised process. However, there was dramatic read count loss at merged and non-chimeric read counts in through F270/R200 to F270/ R206. Interestingly, there was abrupt merged and non-chimeric read count increase in F270/R208 (Fig. 5A). Also, when alpha diversity was determined, there was significant difference between F270/R206 and F270/R208 (Fig. 5B).
Finally, taxa assigned at this condition was determined. The relative abundance of 15 most abundant species was plotted. There was significant difference between F270/R206 and F270/R208. Although F270/R208 produced most non-chimeric read counts, the relative abundance of some species was different from F270/R210, F270/R212 and F270/R216, which showed similar result within (Fig. 6).
Discussion
Recently, it that been noted that many common diseases are highly associated with the microbiome [18,19]. Although a significant portion (≈30%) of oral microbial species remain uncultivable (http://www.homd.org), oral microbiome is well characterized compared to other microbiome research areas [20,21]. Hundreds of microbial species reside in the human oral cavity, and composition of oral microorganisms differs across individuals [22]. The individual oral microbiomes could facilitate the prediction, prevention, and treatment of oral diseases, and the maintenance of oral health [23].
Analysis of the 16S rRNA genes is the most commonly used method for microbiome studies [24]. As longer reads provide more information for taxonomy assignments [9,10], pairedend sequencing approach is commonly used [25]. For effective joining, high-quality sequencing reads are required at the joining region. However, one of the major problem with sequencing is the increased error rate along the distal ends [12]. To reduce the failure of joining due to low-sequencing quality, sequences are trimmed at the reads distal to a point where sequencing quality drops below a specific threshold [15].
In this study, we studied the effect of trimming conditions in QIIME2 platform on microbiome analysis for oral microbiome study. First, prior to applying DADA2 pipeline in QIIME2, sequencing quality check was performed. Phred scoring has become a standard that base calling software output an error probability, in the form of a quality score, for each base call [26]. For Phred quality score 10, 15 and 20, the probability of incorrect base call are 10%, 3.15% and 1%, respectively. A gradual decrease in quality near the distal ends was observed in both forward and reverse reads. At forward 240 bps, Phred score for 25 percentile which is represented as bottom box was near 20. For forward 290 bps, the 25-percentile score was less than 10. For reverse reads, the sequencing quality was significantly lower compared to the forward read. For reverse 180 bps and 220 bps, the 25-percentile score were 20 and less than 10, respectively. Since the target region is around 460 bps [16] and minimum of 10 bps overlapping is required for the joining, trimming was conducted between 240 to 290 bps in forward reads and between 180 to 250 bps in reverse reads considering the quality score.
Next, reads counts that passed each pre-processing step was calculated. The filtered and denoised read count was maximum when each forward reads were joined with shortest reverse reads. During this step, the reads dependent on the individual single-read sequencing result. For forward reads, although F240 has the highest sequencing quality, the 25-percentile is above 15 for F270. The filtered and denoised read counts reached maximum at F270 group. For reverse reads, the shortest reverse reads which as the highest sequencing quality showed most read counts. Interestingly, the read count was similar when 25-percentile quality score reached near 20 such as R180 and R190. Thus, individual single-read sequencing quality mostly have affected the read counts during filtering and denoising step.
For merged and non-chimeric read counts, F240/R240, F250/R230, F260/R220, F270/R210, F280/R200 and F290/ R190 showed maximum read counts for each forward read, respectively. The predicted target size for V3-V4 region is 464 bps [16] and minimum overlapping conditions only provided 6 bps to overlap. Thus, although shorter reverse reads have higher sequencing quality, the merged read count did not show maximum read counts. For forward reads, F270 produced the maximum merged read count suggesting 25-percentile score around 20 can be considered as the threshold for forward read trimming site. Taken together, maximum merged read count was observed when the overlapping region was 16 bps and 25-percentile forward read sequencing quality was near 20.
When the joined sequences were assigned to determine the taxonomy, trimming conditions that produced most nonchimeric read counts also showed the highest alpha diversity. To determine if trimming conditions has effect on taxa assignment, the overall relative abundance showed similar results. Although, relatively abundant species showed similar relative proportion, assigning more read counts can provide more chance to determine minor species that can contribute to disease development or can serve as biomarker for the disease.
Finally, detailed trimming conditions was studied focusing on F270. At preprocessing step, there was dramatic merge read count increase between F270/R206 and F270/R208. Since there is very small difference in sequencing quality, the overlapping length could be the main reason for this increase. The overlapping bps is 12 bps for F270/R206 and 14 bps for F270/ R208. When taxa assigned at this condition was determined, there was significant difference between F270/R206 and F270/R208. Although F270/R208 produced most non-chimeric read counts, this specific condition is very hard to capture in practice. The relative abundance reached similar among F270/ R210, F270/R212 and F270/R216. Taken together, the recommended trimming site for forward reads is the maximum length where 25-percentile sequencing quality is near 20 and minimum overlapping length is more than 16 bps.
In conclusion, we compared the trimming conditions for microbiomes analysis using DADA2 in QIIME2 for oral samples. Our result show that trimming site for forward reads is where 25-percentile sequencing quality score of 20 is maintained and minimum overlapping length should be more than 16 bps.