Introduction
Malocclusion is a common disease that impairs occlusal function, increases the incidence of dental caries, causes psychological discomfort, endangers overall health, and reduces quality of life. Treatment planning must be performed carefully before the treatment process begins to achieve satisfactory orthodontic treatment outcomes. Orthodontic treatment planning is a complex process that relies heavily on the subjective judgment of the practicing orthodontist in general cases. The comprehensive and deliberate evaluation of numerous factors makes treatment planning a complex process with no objective patterns. In identical cases, different orthodontists may have different treatment plans depending on their clinical experience. There is considerable variation when deciding whether to extract teeth and which teeth to extract. Therefore, researchers have attempted to make orthodontic treatment planning more objective using numerous prediction methods.
Due to the advantage in pattern recognition and prediction, various artificial intelligence (AI) techniques have recently been widely used in the field of orthodontics to assist in decision making, medical and dental applications, such as tooth localization and numbering, the detection of dental caries, periodontal and periapical diseases, cancerous oral lesions, the localization of cephalometric landmarks, image quality enhancement, orthodontic treatment outcome prediction, and the compensation of deformation errors in the additive manufacturing of prostheses.
Previous studies based on artificial neural networks (ANNs) have focused on the reproducibility of cephalometric landmark identification and the automatic detection of anatomical reference points from radiological images [1-12]. For example, a convolutional neural network (CNN) method has been used for investigating periapical and panoramic radiographs [13]. However, applications of AI in predicting orthodontic outcomes and related factors have remained relatively limited. Conditional generative adversarial networks (CGANs) [14] have recently gained significant attention in various image processing tasks [15-19], and their use in dental imaging is becoming increasingly common. Gong et al. [20] proposed a CGAN-based model for predicting post-orthodontic soft tissue profiles. Similarly, Kazangirler and Özcan [21] developed DentaGAN, a CGANbased framework for generating individualized synthetic dental radiographs. Alkaabi et al. [22] enhanced CNN performance for forensic age estimation by combining CGANs with pseudolabeling techniques.
Building upon these advances, the present study proposes CGAN-based prediction models that can predict dental images after orthodontic treatment with or without extraction. The proposed deep learning models were evaluated to predict orthodontic treatments’ outcomes in extraction and nonextraction cases. The present study utilized the Pix2Pix [23,24] program with CGAN to perform image-to-image prediction using paired datasets of patients’ radiological images. CGANs are comprised of generator and discriminator models. The Pix- 2Pix software employs a U-net [25] for the generator model and a convolutional PatchGAN classifier [26] for the discriminator model. The U-net has an advantage in offering fast and precise segmentation while requiring less training of the image data because of its complete CNN architecture. The PatchGAN classifier helps identify generated images by style rather than by content. The Pix2Pix software has an additional advantage in using the same noise function as other GANs [23,27]. In addition, the Pix2Pix software processes the learning and prediction of images in the same framework, regardless of the type of image inputs [23].
Materials and Methods
1. Cases collection: radiological image dataset
This study used the data of the lateral cephalograms of patients who received orthodontic treatment at the Department of Orthodontics, Gangneung-Wonju National University (GWNU) Dental Hospital between 2008 and 2019. In this study, the following criteria were used to select appropriate data inclusion for deep learning: (1) lateral cephalograms were taken before and after comprehensive orthodontic treatment with fixed orthodontic appliances, (2) non-extraction treatment or extraction of more than two premolars, (3) permanent dentition.
The exclusion data criteria were (1) craniofacial deformities, (2) orthognathic surgery, (3) prosthodontic rehabilitation during orthodontic treatment, and (4) orthodontic appliances or metallic artifacts on radiographic images. No radiograms were taken specifically for this study, and the protocol of this study was reviewed and approved by the Ethics Committee of GWNU Dental Hospital (IRB2019-006). The requirement to obtain informed consent was waived. The final datasets for deep learning consisted of radiographic images of patients in extraction (N = 390) and non-extraction (N = 209) groups according to their treatment protocols.
2. Pre-process of datasets for deep-learning model
Radiographic images were pre-processed by uniformly adjusting the degree of blackness for the region of interest (upper and lower soft tissues in addition to upper and lower anterior teeth) to be identified. Then, the radiological images were cropped only to include the tip of the nose and the tip of the chin in digital number (DN) units ranging from 0 to 255 with a size of 256 × 256. The original radiological image constituted full facial data of specifications 22,000 × 20,000. These data preserved the original information and were used as input datasets for the Pix2Pix-based prediction model’s training, validation, and application. Notably, the Pix2Pix-based model was trained and tested in numerical array ranging from 0 to 1 for the image-to-image translation in the Pix2Pix software.
Where n is the number of datasets, and Xn is the cropped image of n-th patient. The subscripts i and j denote the row and column in two-dimensional array in the cropped image, respectively. Max and Min denote the minimum and maximum values of the real-observed data. The superimposed images with before- and after-treatment images were cropped to include upper and lower dental arches, nasal tip, upper and lower lips, and chin to minimize unnecessary variance due to head posture, artifacts, and hairs.
Fig. 1A shows an example of the pre-processed radiographic images. Fig. 1B shows a coordinate system based on the pretreatment images. Pre- and post-treatment images were superimposed based on the anterior cranial base, which remained relatively stable throughout the orthodontic treatment [28]. The coordinate system was then applied to the posttreatment images to compensate for differences in head posture. Fig. 1C and 1D shows an example of standardized images of patients before and after orthodontic treatment with a size of 256 × 256 for training model.
3. Deep learning model - Pix2Pix model
For deep learning, the Pix2Pix [24,29] uses a loss function (LPix2Pix) combined with adversarial loss (LCGAN) and reconstruction loss (L1) as follows [16]:
where G and D are the generator and discriminator models. minG maxD {} denotes minimum-maximum function between a generator and a discriminator λ is the tradeoff parameter between the adversarial and the reconstruction losses in this study, we set λ = 1.First, the adversarial loss (LCGAN ) is expressed as follows [14,23,30,31]:
where XR and YR are the pairs of real input and output data. G tries to minimize the log(D(YV, YR)) in the first cross entropy, while D attempts to maximize the probability of discriminating real or virtual data in the second cross entropy in LCGAN. The log function in cross entropies was introduced to compensate for the gradient insufficiency at the beginning of the training [31].Second, the reconstruction loss (L1) is a traditional standard loss as follows [27]:
where L1 loss is used to minimize the distance between the generated dataset (YV) and the real output dataset (YR) for reducing blurry effects.4. Pairing the dataset for adversarial learning
Mathematically, the dataset consists of the pair of the radiological images of the same patient at different observational times.
Where P1(tb) and P1(ta) are the real radiological images of the same patient 1 at time tb before treatment and ta after treatment. The subscript n is the patient number.
In this study, our prediction model was iteratively trained and selected at the iteration step to show the best correlation coefficient (CC) and minimum Root Mean Square Error (RMSE) values between the real radiological image and Pix2Pix-generated virtual radiological image. Notably, this study did not use Pix2Pix loss values because the minimum loss did not guarantee the best quantitative results.
The Pix2Pix prediction model after model construction generates the virtual radiological images of the patients as follows:
where PV,i is the Pix2Pix-generated virtual radiological image data for the patient i, which was not used in the Pix2Pix model construction. Thus, we can estimate the virtual image with effects of orthodontic treatment.5. Training, validation, and application
For the Pix2Pix prediction model development, the paired dataset of the radiological images of the same patient at different times was used as X predictor (before treatment) and Y predictand (after treatment). For example, X is the observed radiological images before treatment at a certain point, whereas Y is the radiological images observed after treatment at a certain point. This study used X in past dataset such as a sequence array (256 × 256 × 2) by stacking two images, and Y in a sequence array (256 × 256 × 1) for prediction. In this study, 390 pairs of extraction case and 209 pairs of nonextraction case data were used to train and test the Pix2Pix prediction model, respectively.
To train the Pix2Pix prediction model, we used 330 and 170 pairs of datasets in a size of 256 × 256 pixels of preprocessed radiological images of patients done by orthodontic treatment with extraction and without extraction, respectively. During this process, our Pix2Pix model was trained to resemble the Pix2Pix-generated virtual radiological images compared to those of the patients, and to distinguish the latter from the former. For Pix2Pix model testing, 30 and 20 pairs of patient data excluded in training datasets were used, respectively. To apply our trained and tested Pix2Pix model, the other 30 and 19 pairs of datasets of the same pixel size were used. Fig. 2A illustrates the structures of G and D in the Pix2Pix prediction model. The arrows indicate the different operations and activation functions of each layer. Each blue box corresponds to a feature map. The image size and number of channels were denoted at the bottom of the box (256, 256, N). The orange boxes represent the copied feature maps. The arrows denote the different operations. Fig. 2B shows the procedure of the Pix2Pix prediction model with the pre-process, model training process, and post-process for applications.
The Pix2Pix model was trained, tested, and applied using TensorFlow with Python 3.7.6, Linux Ubuntu 18.04.5, and on CUDA 10.0 and cuDNN 7.6.5 systems running on four NVIDIA Titan-Xp GPU and an Intel Xeon CPU. The training time of the Pix2Pix prediction model was approximately 2 hours for both extraction and non-extraction models.
The real and Pix2Pix-predicted radiological images were statistically compared pixel-by-pixel using the CC, bias, and RMSE as follows [32]:
where i is the index from 1 to n (the total number of pixels in the real radiological images data), PR,i is the DN value of i-th pixel in the real radiological images, and PV,i is the DN value of i-th pixel in the virtual radiological images. PR and PV are the mean DN values of the real and Pix2Pix-predicted virtual radiological images, respectively.Results
Fig. 3 shows the results of the variations in CC and RMSE between the real and Pix2Pix-predicted radiographic images of the validation datasets. The Pix2Pix model had a maximum CC value of 0.8767 and a minimum RMSE value of 9.0594 DN for the extraction case, and a maximum CC value of 0.8686 and a minimum RMSE value of 8.8808 DN for the non-extraction case. The most common number of iterations in the validation datasets was 36,800; therefore, we adopted this iterationtrained Pix2Pix model to predict virtual radiographic images for the extraction case. For the non-extraction case, we adopted the Pix2pix-prediction model at 26,200 iterations.
Fig. 4 shows the virtual images predicted for the extraction treatment. The real pre-treatment dental images were used in our extraction/non-extraction deep learning models for predictions. Fig. 4A and 4B show the real pre- and post-extraction orthodontic treatment images, respectively. Fig. 4C and 4D show the predicted images from the extraction and nonextraction Pix2Pix-predicted models, respectively, using the actual patient pre-treatment image, Fig. 4A, as the input data. Large anterior overjet before treatment (Fig. 4A) were improved dramatically in both predicted images with normal overjet and overbite relation (Fig. 4C and 4D). However, a significant difference was found in the position and angulation of the lower incisors, which suggests that excessive tipping of the incisors in non-extraction protocol may deteriorate alveolar bone level after orthodontic treatment. Fig. 4E and 4F show the difference between the real post-extraction treatment images and the predicted images from the extraction and non-extraction Pix2Pix-predicted models. Statistical comparisons between the real image and predicted image yielded CC = 0.9014 and 0.8965, respectively, demonstrating that the Pix2Pix-based model predicted dental images with high accuracy.
Fig. 5 shows the predicted results for the non-extraction orthodontic treatment. Fig. 5A and 5B show the real images pre- and post-non-extraction orthodontic treatment, respectively. Fig. 5C and 5D show the predicted images from the extraction and non-extraction models, respectively, using the actual patient pre-treatment image, Fig. 5A, as the input data. Though the predicted image from extraction model (Fig. 5C) showed more retracted anterior incisors and lateral profile of lips than those from the non-extraction model (Fig. 5D), the differences between two prediction images were clinically insignificant. Fig. 5E and 5F show the difference between the real post-extraction treatment images and the predicted images created using the extraction and non-extraction Pix2Pix models. Statistical comparisons between the real image and predicted image yielded CC = 0.8989 and 0.9064 respectively, demonstrating that the Pix2Pix-based model predicted dental images with high accuracy.
Table 1 summarizes the statistically averaged results for the extraction and non-extraction cases. The CCs between predicted and actual images were 0.8767 for the extraction cases and 0.8686 for the non-extraction cases. Based on the sample sizes that were tested, the corresponding t-values are approximately 13.9 and 10.8, respectively, indicating highly statistically significant correlations (p < 0.0001). These results strongly support the accuracy and reliability of the CGAN-based prediction model.
Discussion
This study presented a system developed for predicting post-orthodontic treatment images using Pix2Pix algorithms, and established a deep learning-based prediction model for the outcomes of extraction and non-extraction orthodontic treatments. The results of the present study indicated a possibility for the facial profile to change following the comprehensive orthodontic treatment and extraction of permanent teeth in some patients, and that the facial profile could show deterioration or improvement. Decision-making regarding extraction typically depends on the amount of crowding, the inclination or position of the anterior teeth, the arch length discrepancy, and the soft tissue facial profile before treatment [33]. The alignment of teeth without extraction can displace the anterior teeth forward in patients with crowding, resulting in a protruding lip. In contrast, the spaces left in extraction therapy can help relieve crowding or retract the anterior teeth, while reducing facial convexity. Many other factors, however, also influence the decision to extract or not. Therefore, disagreements between orthodontists have been reported in borderline cases, depending on the clinicians’ experience [34-36]. The present study proposed a deep learning model to predict the treatment outcomes of extraction and non-extraction protocols. The results indicated that the prediction models showed high agreement with real radiographs after treatment, which could assist clinicians in planning orthodontic treatments.
Orthodontic treatment outcomes depend on various factors including characteristics of anchorage type, extraction site, age, sex, growth, and skeletal relationships. However, extraction and non-extraction are the primary factors affecting treatment outcomes. Therefore, this study selected extraction or non-extraction as the deciding factor. Predicting the results before the extraction or non-extraction procedure would be extremely useful in clinical practice, allowing the orthodontists to modify their treatment strategies as the treatment progresses.
The present study has a few limitations. First, no consideration was given to the patients’ skeletal growth patterns or sex. Second, the long-term skeletal effects of orthodontic treatment were not evaluated. Third, images showing significant growth in the maxillofacial area during the period of orthodontic treatment were excluded.
Despite a few limitations, this Pix2Pix-based prediction model can assist orthodontists in determining the necessity of tooth extraction if the model is trained using sufficient datasets. Cephalometric radiographs are crucial diagnostic tools in orthodontics for treatment planning and evaluation. Few studies have used AI techniques in lateral cephalograms, and previous AI-based research has focused solely on recognizing landmarks or measuring linear or angular variables [5,37].
Compared to these previous studies, which utilized ANN or CNN methods and anatomical reference points to predict images and to evaluate decisions regarding extraction, our Pix- 2Pix-based prediction model uses a 2-dimensional image for the image-to-image translation method to predict treatment outcome results. The Pix2Pix model has the advantage of providing a more intuitive prediction method than the anatomical reference index-based ANN or CNN-based models, and also provides a higher prediction accuracy and virtual image close to the real image through adversarial self-verification processes. The present study evaluated Pix2Pix-based models constructed specifically under extraction and non-extraction conditions. Of note, the previous CNN-based decision-making aid model [6] was only applied to extraction cases. The Pix2Pix-based model evaluated in the present study, however, can provide useful information that orthodontists can utilize to make more rational treatment plan decisions, as the predicted treatment outcomes are shown in the virtual images.
The Pix2Pix-based dental image prediction model can simultaneously provide a virtual image in both extraction and non-extraction cases, with an accuracy of about CC = 0.8686 or higher. This Pix2Pix-based prediction model, which predicts post-treatment images using image-to-image translation, provides quantitative and qualitative supplemental data. Consequently, this study will assist the orthodontist’s decisionmaking in objective criteria for a treatment plan.