Early diagnosis of Autism Spectrum Disorder (ASD) plays a crucial role in enriching a child's development, particularly in improving social communication, language development, and addressing behavioural challenges. Early signs of autism may be observable in childhood, but a formal diagnosis often occurs later in life. Behavioural-based assessments, such as the Autism Diagnostic Interview-Revised (ADI-R) and Autism Diagnostic Observation Schedule-Revised (ADOS-R), are currently used for diagnosing ASD. These methods of diagnosis are time-consuming and require trained professionals. Due to these disadvantages of the traditional method of diagnosis, deep learning is used, where feature extraction is done automatically from Magnetic Resonance Imaging (MRI) data, eliminating the reliance on subjective pre-defined features. This advancement not only captures subtle information that may be missed by human-defined features but also enhances accuracy significantly. The dataset comprises of axial view of MRI images from ABIDE-I dataset from Autism Brain Imaging Data Exchange (ABIDE) database. This study proposes a dual-track feature fusion network architecture comprising Swin Transformer and customised Convolutional Neural Network (CNN) for precise classification. Swin Transformers excel in capturing long-range dependencies within images, facilitating a deeper understanding of interrelations among different image components. Concurrently, CNNs are adept at extracting local features, thus contributing to improved classification performance by considering both local and global features. The experimental outcomes highlight the efficacy of the proposed feature fusion network, showcasing an accuracy rate of 98.7%, precision of 98.12%, recall of 98.77%, and an F1-score of 98.65% upon evaluation using the ABIDE dataset.