生成式人工智能在生成影像学报告方面的表现评估

Evaluation of the performance of generative artificial intelligence in generating radiology reports

  • 摘要: 目的评估2种生成式人工智能(AI)在生成腹部影像学报告方面的表现,并与人类医师进行比较。方法回顾性研究2023年6月至2024年5月在中山大学附属第三医院接受腹部CT和MRI检查的300例患者的影像学报告。使用生成式AI模型ERNIE 4.0和Claude 3.5 Sonnet对300例患者的影像学所见重新生成影像学报告,由5名放射科医师采用五点Likert量表(1表示强烈不同意,5表示强烈同意)评估其完整性、准确性、表达、幻觉和无修改接受度。采用Friedman和Nemenyi检验进行统计学分析。比较生成式AI与人类医师的表现差异。结果研究共纳入300例患者的影像学报告。在完整性方面,Claude 3.5 Sonnet与人类医师相当,均优于ERNIE 4.0 (4.86±0.37)分 vs.(4.76±0.46)分 vs.(4.40±0.64)分,前两者比较P = 0.200,前两者与后者比较P均< 0.01。在准确性方面,人类医师优于2种AI模型(4.96±0.22)分 vs.(4.66±0.57)分 vs.(4.69±0.57)分,前者与后两者比较P均< 0.01。在无修改可接受度方面,Claude 3.5 Sonnet与人类医师相当,均优于ERNIE 4.0(4.64±0.53)分 vs.(4.69±0.54)分 vs.(4.30±0.59)分,前两者比较P = 0.595,前两者与后者比较P均< 0.01。在表达和幻觉上,三者比较差异无统计学意义(P均> 0.05)。结论Claude 3.5 Sonnet生成的影像学报告与人类医师水平相当。这提示先进的生成式AI有潜力辅助人类医师的工作,有助于提高效率并减轻认知负担。

     

    Abstract: ObjectiveTo evaluate the performance of two categories of generative artificial intelligence (AI) in generating abdominal radiology reports, and compare with the performance of radiologists. MethodsThe radiology reports of 300 patients who underwent abdominal CT scan and MRI in the Third Affiliated Hospital of Sun Yat-sen University from June 2023 to May 2024 were retrospectively studied. The generative AI models of ERNIE 4.0 and Claude 3.5 Sonnet were utilized to re-generate radiology reports of 300 patients. Five radiologists evaluated the comprehensiveness, accuracy, expressiveness, hallucinations, and acceptance without revision of the impressions using a five-point Likert scale. Friedman test and Nemenyi test were used to compare the performance between two models and radiologists. Results CT and MRI reports from 300 patients were evaluated. For comprehensiveness,Claude 3.5 Sonnet was on a par with human physicians, and both were superior to ERNIE 4.0 (scores of 4.86±0.37 vs. 4.76±0.46 vs. 4.40±0.64; comparison between the first two, P = 0.200, comparison between the first two and the third, both P < 0.01). For accuracy, Radiologists outperformed both ERNIE 4.0 and Claude 3.5 Sonnet (scores of 4.96±0.22 vs. 4.66±0.57 vs. 4.69±0.57; comparison between the first and the latter two, both P < 0.01). For acceptance without revision, Claude 3.5 Sonnet was on a par with human physicians, and both were superior to ERNIE 4.0 (scores of 4.64±0.53 vs. 4.69±0.54 vs. 4.30±0.59; comparison between the first two, P = 0.595, comparison between the first two and the third, both P < 0.01). Expressiveness and hallucinations metrics showed minimal variations among the three (all P > 0.05). Conclusions Claude 3.5 Sonnet yields comparable performance to radiologists in generating radiology reports, indicating that advanced generative AI has the potential to assist radiologists, improve the work efficiency and reduce cognitive burden.

     

/

返回文章
返回