Prescriptive method for optimizing cost of data collection and annotation in machine learning of clinical ultrasound

Lawley, Alistair and Hampson, Rory and Worrall, Kevin and Dobie, Gordon; (2023) Prescriptive method for optimizing cost of data collection and annotation in machine learning of clinical ultrasound. In: 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, AUS. ISBN 9798350324471 (https://doi.org/10.1109/EMBC40787.2023.10340858)

[thumbnail of Lawley-etal-EMBC-2023-data-collection-and-annotation-in-machine-learning-of-clinical-ultrasound]
Preview
Text. Filename: Lawley_etal_EMBC_2023_data_collection_and_annotation_in_machine_learning_of_clinical_ultrasound.pdf
Accepted Author Manuscript
License: Strathprints license 1.0

Download (1MB)| Preview

Abstract

Machine learning in medical ultrasound faces a major challenge: the prohibitive costs of producing and annotating clinical data. Optimizing the data collection and annotation will improve model training efficiency, reducing project cost and times. This paper prescribes a 2-phase method for cost optimization based on iterative accuracy/sample size predictions, and active learning for annotation optimization. Methods: Using public breast, fetal, and lung ultrasound datasets we can: Optimize data collection by statistically predicting accuracy for a desired dataset size; and optimize labeling efficiency using Active Learning, where predictions with lowest certainty were labelled manually using feedback. A practical case study on BUSI data was used to demonstrate the method prescribed in this work. Results: With small data subsets, ~10%, dataset size vs. final accuracy relations can be predicted with diminishing results after 50% usage. Manual annotation was reduced by ~10% using active learning to focus the annotation. Conclusion: This led to cost reductions of 50%-66%, depending on requirements and initial cost model, on BUSI dataset with a negligible accuracy drop of 3.75% from theoretical maximums. Clinical Relevance— This work provides methodology to optimize dataset size and manual data labelling, this allows generation of cost-effective datasets, of interest to all, but particularly for financially limited trials and feasibility studies, Reducing the time burden on annotating clinicians.