A Deep CNN-Augmented Vision Transformer Framework for Clinical Diagnosis of X-Ray Bone Fractures

Research Paper

Abstract

In the realm of healthcare, efficient and accurate classification of bone fractures plays a pivotal role in timely diagnosis and treatment planning. This study introduces a groundbreaking methodology that combines sophisticated image preprocessing techniques with state-of-the-art deep learning (DL) architectures, specifically utilizing Vision Transformer with EfficientNet-V2B0 Backbone. The preprocessing pipeline involves a series of transformative steps applied to bone X-ray images, including resizing, histogram equalization, Gaussian blurring, median filtering, bilateral filtering, and normalization. These techniques collectively enhance the quality and relevance of the input data, laying a robust foundation for subsequent classification. The heart of our proposed approach lies in integrating a Vision Transformer (ViT) with the powerful EfficientNet-V2B0 Backbone. This fusion of attention-based mechanisms and efficient neural network architecture captures intricate patterns within the images and ensures optimal information retention during the classification process. Our experimental results showcase remarkable performance metrics, setting a new benchmark in bone fracture classification. The proposed EfficientNetV2B0 ViT model achieves an impressive accuracy of 99.55%, precision of 99.43%, recall of 99.65%, F1 score of 99.55%, and specificity of 99.75%. These results substantially outperform existing approaches in the field, signifying the superiority of our methodology. This research not only advances the field of bone fracture classification but also underscores the potential for merging cutting-edge image preprocessing techniques with advanced DL architectures to achieve unprecedented accuracy and reliability in healthcare diagnostics.

Authors

Pandiyaraju V; Shravan Venkatraman; Abeshek A; Aravintakshan S A; Pavan Kumar S; Kannan A