Latest News


Concerns about validity of diagnosis by AI — Tohoku University revealed discrepancies with findings by physicians


The development of AI for use in medical image diagnosis is ongoing. However, discrepancies have been found between the areas that AI focuses on for diagnosis and the areas that doctors consider important. A research team led by Assistant Professor Yuwen Zeng, Part-time Lecturer Xiaoyong Zhang (currently with the Department of Integrated Engineering at the National Institute of Technology (KOSEN), Sendai College), and Professor Noriyasu Homma from the Department of Radiological Imaging and Informatics at Tohoku University Graduate School of Medicine has highlighted these discrepancies. This raises concerns regarding the medical validity of AI-based medical image diagnosis. Through further examination of the concerns and measures to address them, including the development of new training methods, the enhancement of the clinical safety of AI applications is anticipated. The findings were published in the Journal of Imaging Informatics in Medicine.

To validate the reliability of artificial intelligence in medical image diagnosis, a visualization technique was used to extract the model's "focus areas" and evaluate how well these align with the "medical findings" annotated based on a radiologist's assessments of the same images.
Provided by Tohoku University

With advances in deep learning and other technologies, many reports have shown that AI demonstrates high performance comparable to that of medical specialists in medical image diagnosis. However, the reliability of such performances and their applicability in real-world clinical settings are under scrutiny. In particular, there is a lack of validation regarding how well the features in medical images identified by deep learning models align with medical findings. This raises concerns about potential discrepancies with doctors' diagnoses in clinical settings.

The research group performed drowning diagnosis using postmortem images in forensic science as an example to examine the medical validity of a deep learning model known for its high performance in previous studies. Specifically, image features that the deep learning model focused on were identified using visualization techniques and defined as regions of interest. Additionally, regions in the images annotated based on the radiologist's findings were defined as medically important and compared with the model's regions of interest.

The results showed that the percentage of matching between the model's region of interest and the medically important region was 30% in the lowest case. Even when the matching percentage was approximately 80%, the position of importance in the region was different. Considering that the deep learning model validated in this study has been reported to classify drowning with a high accuracy rate of over 90% in previous studies, an unexpectedly large discrepancy existed between the model's output and the clinical medical findings.

In previous studies, it has been reported that conclusions drawn by deep learning models based on inappropriate criteria can result in unexpected misdiagnoses. There has been a report regarding a risk of unanticipated performance degradation in deep learning models when applied to image of cases with varying characteristics. The discrepancies revealed in this study show the necessity for alternative verification methods and highlight challenges in assessing AI performance.

From the results obtained in this study, it has become apparent that a new training method is necessary to acquire medically valid evidence. Current training methods often rely on a single evaluation criterion in the medical context for the sake of mathematical simplicity and feasibility. However, it is anticipated that the training method will be developed to incorporate more multifaceted medical criteria. The research group is currently developing this method for clinical application.

Journal Information
Publication: Journal of Imaging Informatics in Medicine
Title: Inconsistency between Human Observation and Deep Learning Models: Assessing Validity of Postmortem Computed Tomography Diagnosis of Drowning
DOI: 10.1007/s10278-024-00974-6

This article has been translated by JST with permission from The Science News Ltd. ( Unauthorized reproduction of the article and photographs is prohibited.

Back to Latest News

Latest News

Recent Updates

    Most Viewed