A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech Using Laryngeal High-Speed Videoendoscopy

Autor/inn/en	Yousef, Ahmed M.; Deliyski, Dimitar D.; Zacharias, Stephanie R. C.; de Alarcon, Alessandro; Orlikoff, Robert F.; Naghibolhosseini, Maryam
Titel	A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech Using Laryngeal High-Speed Videoendoscopy
Quelle	In: Journal of Speech, Language, and Hearing Research, 65 (2022) 6, S.2098-2113 (16 Seiten) PDF als Volltext Verfügbarkeit
Zusatzinformation	ORCID (Naghibolhosseini, Maryam)
Sprache	englisch
Dokumenttyp	gedruckt; online; Zeitschriftenaufsatz
ISSN	1092-4388
Schlagwörter	Voice Disorders; Speech; Video Technology; Equipment; Automation; Artificial Intelligence; Ohio (Cincinnati) + Suchen Sie Ihr Suchwort? Stimmstörung; Speaking; Sprechen; Künstliche Intelligenz
Abstract	Purpose: Voice disorders are best assessed by examining vocal fold dynamics in connected speech. This can be achieved using flexible laryngeal high-speed videoendoscopy (HSV), which enables us to study vocal fold mechanics with high temporal details. Analysis of vocal fold vibration using HSV requires accurate segmentation of the vocal fold edges. This article presents an automated deep-learning scheme to segment the glottal area in HSV from which the glottal edges are derived during connected speech. Method: Using a custom-built HSV system, data were obtained from a vocally healthy participant reciting the "Rainbow Passage." A deep neural network was designed for glottal area segmentation in the HSV data. A recently introduced hybrid approach by the authors was utilized as an automated labeling tool to train the network on a set of HSV frames, where the glottis region was automatically annotated during vocal fold vibrations. The network was then tested against manually segmented frames using different metrics, intersection over union (IoU), and Boundary F1 (BF) score, and its performance was assessed on various phonatory events on the HSV sequence. Results: The designed network was successfully trained using the hybrid approach, without the need for manual labeling, and tested on the manually labeled data. The performance metrics showed a mean IoU of 0.82 and a mean BF score of 0.96. In addition, the evaluation assessment of the network's performance demonstrated an accurate segmentation of the glottal edges/area even during complex nonstationary phonatory events and when vocal folds were not vibrating, thus overcoming the limitations of the previous hybrid approach that could only be applied to the vibrating vocal folds. Conclusions: The introduced automated scheme guarantees accurate glottis representation in challenging color HSV data with lower image quality and excessive laryngeal maneuvers during all instances of connected speech. This facilitates the future development of HSV-based measures to assess the running vibratory characteristics of the vocal folds in speakers with and without voice disorder. (As Provided).
Anmerkungen	American Speech-Language-Hearing Association. 2200 Research Blvd #250, Rockville, MD 20850. Tel: 301-296-5700; Fax: 301-296-8580; e-mail: slhr@asha.org; Web site: http://jslhr.pubs.asha.org
Erfasst von	ERIC (Education Resources Information Center), Washington, DC
Update	2024/1/01