News & Perspective

HKUMed unveils AI model achieving over 90% accuracy in thyroid cancer diagnosis

19 Jul 2025

This AI model aims to alleviate the burden on clinicians by automating the extraction of vital clinical information.3 It employs advanced large language model (LLM) strategies and classifies patients according to the 8th edition of the American Joint Committee on Cancer (AJCC) staging system and the American Thyroid Association (ATA) risk categories.3,4 By integrating multiple offline open-source LLMs, including Mistral, Gemma, Llama, and Qwen, the model streamlines data extraction, significantly reducing the manual effort needed to gather relevant information from unstructured notes. 

To validate their model, the research team sourced 339 semi-structured pathology reports from the public Cancer Genome Atlas-Thyroid Cancer (TCGA-THCA) database.3 They used 50 of these reports for model development and the remaining 289 for validation.3 The development set included both papillary and follicular thyroid carcinomas in proportions that reflect real-world cases.3 Based on the AJCC staging system, patients were categorized into various stages: stage I (n=31), stage II (n=15), stage III (n=2), and stage IVB (n=2).3 The ATA risk categories—low, intermediate, and high—were also evenly represented, with expert clinicians annotating these reports to identify key entities necessary for staging and risk classification.3 

The model's performance was assessed using F1-score, which measure both precision and recall.3 In the development phase, the ensemble strategies produced impressive results, achieving F1-score of 100% for ATA risk classification and 94.1% for AJCC staging.3 For the 289 validation cases from the TCGA-THCA dataset, the ensemble model maintained high accuracy, obtaining F1-score ranging from 95.2% to 95.5% for ATA risk and 98.1% for AJCC staging.3 Individual LLMs also performed well, with scores ranging from 88.5% to 99.7%.3 Additionally, the model was tested on 35 pseudo-clinical cases that mirrored real-world scenarios, yielding F1-score of 88.5% for ATA risk categorization and between 90.4% and 92.9% for AJCC staging.3 Despite these promising results, the researchers acknowledge some limitations.3 The model occasionally struggles to differentiate between microscopic and gross extra-thyroidal extension.3 Furthermore, the limited representation of patients with advanced-stage thyroid cancer in the development set may impact performance in these cases.3 Therefore, human verification of AI-generated outputs remains crucial.3 In conclusion, this study highlights the potential of AI, particularly lightweight LLMs, to enhance the extraction and classification of critical data from unstructured clinical notes.3 By enabling local deployment with its offline capability, the AI model ensured patient privacy while improving the speed and consistency of thyroid cancer staging and risk assessment.3 This innovation represents a major step toward integrating AI into clinical workflows and sets the stage for broader applications in oncology and beyond.3 

Get access to our exclusive articles.
Related Articles