Towards high-performance deep learning architecture and hardware accelerator design for robust parameters analysis in diffuse correlation spectroscopy

Zang, Zhenya and Wang, Quan and Pan, Mingliang and Zhang, Yuanzhe and Chen, Xi and Li, Xingda and Li, David Day Uei (2025) Towards high-performance deep learning architecture and hardware accelerator design for robust parameters analysis in diffuse correlation spectroscopy. Computer Methods and Programs in Biomedicine, 258. 108471. ISSN 0169-2607 (https://doi.org/10.1016/j.cmpb.2024.108471)

[thumbnail of Zhang-etal-CMPB-2024-Towards-high-performance-deep-learning-architecture-and-hardware]
Preview
Text. Filename: Zhang-etal-CMPB-2024-Towards-high-performance-deep-learning-architecture-and-hardware.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (10MB)| Preview

Abstract

This study proposes a compact deep learning (DL) architecture and a highly parallelized computing hardware platform to reconstruct the blood flow index (BFi) in diffuse correlation spectroscopy (DCS). We leveraged a rigorous analytical model to generate autocorrelation functions (ACFs) to train the DL network. We assessed the accuracy of the proposed DL using simulated and milk phantom data. Compared to convolutional neural networks (CNN), our lightweight DL architecture achieves 66.7% and 18.5% improvement in MSE for BFi and the coherence factor β, using synthetic data evaluation. The accuracy of rBFi over different algorithms was also investigated. We further simplified the DL computing primitives using subtraction for feature extraction, considering further hardware implementation. We extensively explored computing parallelism and fixed-point quantization within the DL architecture. With the DL model’s compact size, we employed unrolling and pipelining optimizations for computation-intensive for-loops in the DL model while storing all learned parameters in on-chip BRAMs. We also achieved pixel-wise parallelism, enabling simultaneous, real-time processing of 10 and 15 autocorrelation functions on Zynq-7000 and Zynq-Ultrascale+ field programmable gate array (FPGA), respectively. Unlike existing FPGA accelerators that produce BFi and the β from autocorrelation functions on standalone hardware, our approach is an encapsulated, end-to-end on-chip conversion process from intensity photon data to the temporal intensity ACF and subsequently reconstructing BFi and β. This hardware platform achieves an on-chip solution to replace post-processing and miniaturize modern DCS systems that use single-photon cameras. We also comprehensively compared the computational efficiency of our FPGA accelerator to CPU and GPU solutions.

ORCID iDs

Zang, Zhenya, Wang, Quan, Pan, Mingliang, Zhang, Yuanzhe, Chen, Xi, Li, Xingda and Li, David Day Uei ORCID logoORCID: https://orcid.org/0000-0002-6401-4263;