A cost focused framework for optimizing collection and annotation of ultrasound datasets

Lawley, Alistair and Hampson, Rory and Worrall, Kevin and Dobie, Gordon (2024) A cost focused framework for optimizing collection and annotation of ultrasound datasets. Biomedical Signal Processing and Control, 92. 106048. ISSN 1746-8094 (https://doi.org/10.1016/j.bspc.2024.106048)

[thumbnail of Lawley-etal-BSPC-2024-A-cost-focused-framework-for-optimizing-collection-and-annotation-of-ultrasound-datasets]
Preview
Text. Filename: Lawley-etal-BSPC-2024-A-cost-focused-framework-for-optimizing-collection-and-annotation-of-ultrasound-datasets.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (6MB)| Preview

Abstract

Machine learning for medical ultrasound imaging encounters a major challenge: the prohibitive costs of producing and annotating clinical data. The issue of cost vs size is well understood in the context of clinical trials. These same methods can be applied to optimize the data collection and annotation process, ultimately reducing machine learning project cost and times in feasibility studies. This paper presents a two-phase framework for quantifying the cost of data collection using iterative accuracy/sample size predictions and active learning to guide/optimize full human annotation in medical ultrasound imaging for machine learning purposes. The paper demonstrated potential cost reductions using public breast, fetal, and lung ultrasound datasets and a practical case study on Breast Ultrasound. The results show that just as with clinical trials, the relationship between dataset size and final accuracy can be predicted, with the majority of accuracy improvements occurring using only 40-50% of the data dependent on tolerance measure. Manual annotation can be reduced further using active learning, resulting in a representative cost reduction of 66% with a tolerance measure of around 4% accuracy drop from theoretical maximums. The significance of this work lies in its ability to quantify how much additional data and annotation will be required to achieve a specific research objective. These methods are already well understood by clinical funders and so provide a valuable and effective framework for feasibility and pilot studies where machine learning will be applied within a fixed budget to maximize predictive gains, informing resourcing and further clinical study.