Introduction

Breast cancer is the most common site-specific cancer in women and is the leading cause of death from cancer for women aged 20 to 59 years. It accounts for 26% of all newly diagnosed cancers in females and is responsible for 15% of the cancer-related deaths in women. Besides lung cancer, breast cancer is the most common cause of cancer related death in the world, the latest cancer statistics for the USA estimate that in 2022, 31% of cancer cases detected in women were breast cancer, with 43,250 cases resulting in death. This accounts for 15% of all cancer-related deaths [1,2]. Approximately 600 men and 40,000 women die of breast cancer annually according to the findings of the American Cancer Society [3].

About 15% of women who get breast cancer have a positive family history of it, but only 5-10% of breast cancers are inherited. The most significant risk factors for breast cancer are female gender and age [4].

Breast cancer can be classified as benign, in situ carcinoma, or invasive carcinoma. Benign tumours are not classified as dangerous because they cause only a slight alteration in breast anatomy. In situ carcinomas are not dangerous if detected in the early stage and treated, because only the mammary duct lobules are affected and they do not spread to the other tissues. Invasive carcinoma is the most dangerous type because it can spread to any organ in the body [3].

Early detection is a crucial step to control and treat breast cancer. This aims at the detection of breast cancer very early in asymptomatic females to offer them a better chance of cure. Early breast cancer screening through mammography, ultrasound, or magnetic resonance imaging (MRI) has played an important role in helping detect breast cancer in the early stage, reduce the mortality rate, and improve prognosis; the mortality rate of breast cancer dropped by 40% from 1989 to 2017, which translates to 375,900 breast cancer deaths averted [1,2].

The prognosis of breast cancer is critically affected by early detection and treatment; for example, in 2020 more than 65% of breast cancer patients were diagnosed in the early stage of cancer and survived. There has been a significant decrease in breast cancer-related mortality in the United States between 1975 and 2000, and this is attributed to continued improvement in both screening mammography and treatment [4,5].

The clinical cure rate of breast cancer is highly optimistic and exceeds 90% if diagnosed in the early stage, and this rate decreases as the disease progresses, ranging from 50% to 70% in the middle stage, while treatment is typically not effective in the late stage [3].

Different breast cancer imaging modalities such as mammogram, US, MRI, and histopathological imaging can be used to detect and analyse the key features affecting the diagnosis and treatment.

Mammography represents the mainstay of breast cancer screening and a significant method for the detection and staging of breast cancer, evaluation of treatment efficacy, and follow-up examination. Mammography screening is one of the most widely used modalities for early breast cancer detection, and it has been shown to decrease mortality in multiple randomised clinical trials. Despite this, its performance is often unsatisfactory, with lower sensitivity (i.e. missing one in 8 cancers during interpretation) and very high false positive rates (i.e. < 30% of biopsies are malignant). Approximately 9-10% of the 40 million US women who undergo routine breast screening each year are recalled for additional diagnostic imaging; only 4% to 5% of women recalled are ultimately diagnosed as having breast cancer, and because of the downfall of mammography the need for other adjuvant imaging modalities is increased [2,6,7].

Radiologists view and interpret breast images produced by these modalities and use them for diagnosis. Analysing breast images remains difficult due to the high heterogeneity of breast tumours and the long hours worked by radiologists. Its benefit is dependent on subjective human interpretation to maximally extract all diagnostic information from the acquired images, and this can lead to misjudgement and misdiagnoses, which results in lower cancer detection sensitivity and specificity and large inter-reader variability. Thus, utilisation of new automatic methods to analyse all kinds of breast screening images to assist radiologists in interpreting images is required. To help overcome these clinical challenges, researchers have made great efforts to develop computer-aided detection and/or diagnosis (CAD) schemes of breast images to provide radiologists with decision-making support tools [4].

Artificial intelligence (AI) provides the capability of a computer system to interpret and analyse breast imaging, which can alleviate potential human errors.

AI applications have shown excellent performance in various image recognition tasks, and their use in breast cancer screening has been explored in numerous studies [3].

Methodology

A systematic literature review was conducted to compare the diagnostic performance of AI with that of conventional radiologists in detecting breast cancer using mammograms, following the PRISMA guidelines. A meta-analysis was performed using Review Manager (RevMan) version 5.4. Pooled sensitivity and specificity with corresponding 95% confidence intervals (CIs) were calculated for both AI and radiologists. Heterogeneity among studies was assessed using the statistic, with values greater than 50% indicating substantial heterogeneity. The significance of heterogeneity was tested using the χ2 with a p-value of less than 0.10 (Figure 1).

Figure 1

Flow chart of the studies selection process

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g001_min.jpg

Study selection, inclusion, and exclusion criteria

Comprehensive database searches including PubMed, Scopus, and Web of Science incorporated keywords such as “artificial intelligence”, “radiologists”, “breast cancer”, “mammograms”, and “diagnostic accuracy”. All acquired studies from the search were integrated into Rayyan for subsequent analysis and deduplication, then they were screened and reviewed by title, abstract, and full text screening. Inclusion and exclusion criteria were predetermined. Studies were selected if they included the following:

  • comparison of AI systems to radiologists in interpreting mammograms;

  • sensitivity and specificity data;

  • were published in peer-reviewed journals;

  • involved human subjects.

Extracted data included study characteristics, non-English language-based articles, duplicated studies, book chapters, conference abstracts, studies employing animal subjects, or studies with no relevant data.

Results

This systematic review and meta-analysis included 8 studies with data from a total of 120,950 patients (Tables 1 and 2). Regarding the sensitivity of AI, the pooled analysis of 6 studies with sensitivities ranging from 0.70 to 0.89 yielded a sensitivity of 0.85 (95% CI: [0.79–0.89]) (Figure 2). On the other hand, the sensitivity of radiologists ranged from 0.63 to 0.85 with an overall sensitivity of 0.77 (95% CI: [0.69–0.83]) (Figure 3). Significant heterogeneity was detected in both cases (I2 > 90%, p < 0.01).

Table 1

Baseline table for the included studies

Authors, years [Ref.]n.eSnAISpAIAUCAIn.cSnRadSpRadAUCRad
Kim et al., 2020 [17]16088.75820.9475.2771.610.81
Watanabe et al., 2019 [19]900.840.814
Salim et al., 2020 [18]73986.709385.0098.50
Rodriguez Ruiz et al., 2019 [19]26520.840.814
Rodriguez Ruiz et al., 2018 [24]2400.86790.8983.0077.000.87
Lee et al., 2020 [22]10087.00790.9262.9068.700.748
Akselrod-Balin et al., 2019 [23]254887.00770.91
Lauritzen et al., 2022 [21]11,442169.709970.8098.10
Table 2

Demographic data of the included studies

Population
Index tests
Target
Study: Author, year [Ref.]
Women undergoing mammography
AI and radiologists (ref: histopath diagnosis)
Target Early breast CA
No of ptsOutcomes
SnAIRadSpAIRadAUCAIRad
Kim, 2020 [17]South Korea data setsTotal 320 (n = 160)0.9590.81
0.97
US data set0.953
UK data set0.938
Reader study88.7575.2781.8771.610.94
Watanabe, 2019 [19]Total 122 (n = 90)0.840.814
0.66390.759
Salim, 2020 [18]OverallTotal 113,663 (n = 739)86.78592.598.5
181.977.496.696.60.956
26780.196.697.20.922
367.496.70.92
Dembrower, 2020 [13]Total 7364 (n = 547)
Rodriguez Ruiz, 2019 [16]OverallTotal 1393 (n = 499)0.840.814
10.850.840.490.7830.769 (0.698, 0.840)
2N/AN/AN/A0.9150.907 (0.854, 0.961)
30.80.770.790.8790.858 (0.814, 0.901)
40.850.770.670.850.815 (0.767, 0.864)
50.860.820.540.8250.787 (0.732, 0.841)
60.810.830.510.7960.803 (0.763, 0.843)
70.860.840.680.8520.860 (0.831, 0.889)
80.750.760.750.8170.808 (0.752, 0.859)
90.810.830.730.8610.841 (0.785, 0.897)
Sickles, 2002 [25]
Rodriguez Ruiz, 2018 [24]0.868379770.890.87
Lee, 2022 [23]OverallTotal 200 (n = 100)8762.97968.70.9150.748
Akselrod-Balin, 2019 [22]8777.30.91
Lauritzen, 2022 [21]69.770.898.698.1
Figure 2

The sensitivity of artificial intelligence (AI)

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g002_min.jpg
Figure 3

The sensitivity of radiologists

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g003_min.jpg

As for specificity, the radiologist and AI groups had closer results. The pooled specificity of AI was 0.89 (95% CI: [0.76–0.95]) ranging from 0.77 to 0.99 in 6 studies (Figure 4), whereas for the radiologists it was 0.90 (95% CI: [0.71–0.97]) with a range of 0.68 to 0.99 in 5 studies (Figure 5). Significant heterogeneity was detected in both cases (I2 > 90%, p < 0.01).

Figure 4

The pooled specificity of artificial intelligence (AI)

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g004_min.jpg
Figure 5

The pooled specificity of radiologists

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g005_min.jpg

Estimating the area under the curve (AUC) for both AI and radiologists revealed a beneficial effect of AI with a pooled AUC of 0.89 (95% CI: [0.86–0.92]) with significant heterogeneity (I2 = 92%, p < 0.01) (Figure 6) compared to 0.82 (95% CI: [0.80–0.83]) for the radiologists, without significant heterogeneity (I2 = 48%, p = 0.10) (Figure 7).

Figure 6

The area under the curve (AUC) for artificial intelligence

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g006_min.jpg
Figure 7

The area under the curve (AUC) for radiologists (Rad)

https://www.polradiol.com/f/fulltexts/195520/PJR-90-195520-g007_min.jpg

Discussion

This meta-analysis of 8 studies involving 120,950 patients compared the diagnostic performance of AI and radiologists, demonstrating that AI exhibits a higher pooled sensitivity (0.85 vs. 0.77 for radiologists) and similar specificity (0.89 vs. 0.90 for radiologists), with significant variability observed in both sensitivity and specificity across studies. AI also showed a superior pooled AUC of 0.89 compared to 0.82 for radiologists, indicating better overall diagnostic accuracy despite variability, suggesting AI’s potential to surpass human radiologists in certain contexts.

Similarly, Al-Karawi et al. [3] suggest that AI techniques, particularly deep learning (DL), are increasingly applied in various medical fields, including breast imaging, exhibiting robust performance in tasks such as image recognition. This enhances prospects for applications in in vitro diagnosis, rehabilitation, medical imaging, and prognosis. They also found that despite challenges such as multitasking limitations, ongoing advancements in DL-based systems for breast imaging, such as digital breast tomosynthesis and ultrasound, show rapid development. These systems aid in detecting, classifying, and predicting breast diseases, thereby improving diagnostic efficiency and treatment efficacy. Furthermore, AI’s capabilities offer fast computation, repeatability, and objective data analysis, reducing medical professionals’ workload and enhancing diagnostic accuracy and treatment outcomes [2,8].

Aligning with these results, a study conducted by Zhang et al. [9] supports the use of AI for breast cancer diagnosis through dual-modal deep polynomial networks, which proved superior to other methods in enhancing breast tumour classification performance. This suggests that AI-based techniques can approach the accuracy of breast tumour biopsies, proposing a dual-modal AI framework for diagnosis. However, experimental findings highlight dual-modal deep polynomial networks as the most effective framework, indicating AI’s potential for streamlined breast tumour classification.

Similar results were found by Cè et al. [10], who highlighted significant advancements in AI-based tools for personalised patient management. The integration of AI into clinical workflows is expected to benefit women, radiologists, and healthcare systems by improving diagnostic accuracy and streamlining workload. AI facilitates the development of personalised screening protocols and aids in early lesion detection, thereby reducing overdiagnosis risks.

Gilbert et al. [11] recommend that AI systems be used to mitigate challenges related to the development, expansion, or continuation of screening programs in countries with insufficient experienced breast radiologists. In such situations, AI systems could be utilised as either a primary or secondary reader, thereby mitigating the impact of radiologist shortages.

Conversely, recent research assessing differences in diagnostic performance indicates that radiologists’ interpretations of mammography are superior to AI analyses in clinical practice, particularly when considering clinical symptoms and prior mammograms. These findings have been supported by similar studies that explored the role of AI in clinical practices. AI should support rather than replace radiologists, especially in complex cases. Additionally, appropriate validation of AI scoring thresholds is essential for effective clinical use. AI-based image analysis differs from the comprehensive evaluation performed by radiologists, which includes consideration of clinical symptoms and comparisons with previous mammograms. Radiologists detected subtle or ambiguous findings that AI often missed. Instances were observed where AI failed to identify clearly visible lesions, indicating limitations in AI algorithms and suggesting that AI’s threshold scores may not reliably indicate malignancy, highlighting the need for refined score classifications [12-15].

However, this meta-analysis highlights the promising potential of collaborative use of AI algorithms alongside radiologists in enhancing diagnostic accuracy and efficiency in breast cancer detection, necessitating careful consideration of its limitations and validation in clinical practice. Further research efforts should prioritise the refinement of AI algorithms, validation of their clinical utility, and exploration of optimal strategies for integrating AI into routine clinical workflows to maximise its potential impact on patient care.

Disclosures

  1. Institutional review board statement: Not applicable.

  2. Assistance with the article: None.

  3. Financial support and sponsorship: None.

  4. Conflicts of interest: None.