Making your IT work for you

UKSpeech-2023

Introduction

Parkinson’s disease (PD) is a condition in which parts of the brain become progressively damaged over many years and is projected to continue to rise in incidence. 

Issues where neurological damage affects the quality of speech, can lead to further withdrawal as communication becomes more difficult, leading to  social isolation. 

The aim of the research is to assist people living with PD where speech has been affected and enabling them to be understood, using Artificial Intelligence to convert the speech into text to display to assist with understanding.

The research is focused on voice recordings from participants diagnosed with PD, for re-training an existing artificial intelligence language model. This is subsequently used to translate speech to text with additional interpretation from Natural Language Processing (NLP).

Research description

June 2022 a call for participants was launched through Parkinson’s UK for face-to-face recordings. 

January 2023 a call for participants, including online recordings via Zoom. Sampled speech is required to be optimised to 16kHz for re-training existing language models, Zoom recordings would have minimal effect on Automatic Speech Recognition (ASR), but would significantly increase the number of participants able to contribute. 

The research methodology involves gathering sampled speech using matched pair microphones. Four recording stages are used for participant: 

  1. Standard text passages[1], often used by speech therapists for linguistic analysis, are read. This assist with ease of labelling the speech files. 
  2. ‘Free speech’ on a subject chosen by each participant. 
  3. Selection of short sentences[2]
  4. Vocal exercises[3] interspersed with short read passages. 

Previous research at Kings College in London[4] and at Università degli Studi di Bari ’Aldo Moro’[5] helped focus the selection of reading material used during speech recordings:

  1. North wind and the sun
  2. BNC Tech Eng Computer applications in geography
  3. Grandfather passage
  4. The Rainbow passage – Fairbanks
  5. Comma gets a cure – Jill McCullough & Barbara Somerville
  6. Arthur the Rat

Where possible, recordings are made at a specific time of the day when participants find their speech most affected. This is often late afternoon before the next dose of medication is due.

ASR Toolkit

The Nvidia ASR toolkit was selected as the platform for research, due to the low Word Error §higher percentage of accuracy. The recorded speech is used with labelling to fine-tune  or re-train an existing acoustic language model.

The Nvidia ASR toolkit was selected as the platform for research, due to the low Word Error §higher percentage of accuracy. The recorded speech is used with labelling to fine-tune  or re-train an existing acoustic language model.

Results

Initial assessment to determine whether existing language models are capable of accurately performing ASR, are carried out by feeding recorded samples passing through a local instance of an Nvidia trained language model. 

This incorporates NLP which then tries to make sense of the recording into natural language – see example.

The figure below shows stereo samples of a participant’s speech, significantly impacted as a result of neurological damage, which resulted in a high WER.

Example 1 – participant recorded speech [Grandfather passage] is passed through the Nvidia language model:

Manually transcribed speech: 

He dresses himself in an old black frock coat, usually several buttons, and with a thick hood. A long bread clings to his chin, giving those who observe him a pronounced feeling of the utmost respect. 

Nvidia ASR transcribed speech:

He dresses himself in a black coat. Usually several buttons. I’m gonna think Hood. Along vehicle in his cha change. Him pronounced feeling of most respect.

Example 2 – participant recorded speech passed through an existing Nvidia trained language model:

Manually transcribed speech: 

One night the rats heard a loud, a loud noise in the loft. It was a very dreary old place. The roof let the rain come washing it in, washing the beams and rafters had all rotted through, so that the whole thing was quite unsafe.

Nvidia ASR converted speech:

On that. Loud noise in the loft. It the only ju place the roof about the rain, com in, walking in, walking the reams and ras at all right through. The whole thing was quite unsafe.

Re-training a speech language model with additional audio sample with associated labels, significantly reduces the WER, and therefore accuracy of the ASR.

Re-training a speech language model with additional audio sample with associated labels, significantly reduces the WER, and therefore accuracy of the ASR.

For participants where speech still converts with a low WER, it was still considered worthwhile to record using the same techniques. The data then provides a reference point to compare with future recordings to assist with predicting deterioration. This would be used to re-train the language model without further speech samples.

This information will then assist with helping to predict how deterioration of the speech may occur in the future, and therefore provide higher WER without further speech samples.

Examples of visual recording of phrases read between exercises:

Conclusion & impact

The project is still in the data collection phase. Early results illustrate that neurological changes resulting from PD affect the intelligibility of speech in people in different ways, and existing language models are not able to accurately translate speech to text.

We are currently looking to expand links with speech therapists and other support groups to obtain additional speech recordings, for people with significantly impacted speech by PD and other neurological conditions. This will help expand the number of eligible participants and therefore help re-train a more diverse language model. 

An unexpected outcome from recording the vocal exercises, was motivating participants to engage with regular exercise, after discussing the visual and auditory speech samples. 

This is another area I am keen to explore with a goal of developing other AI tools to form part of the speech therapy toolkit. 

How can you help

Volunteers

We’re looking for speech therapists who can signpost potential volunteers with PD where speech is significantly affected. If you can support us increase the speech samples, please get in touch.

Re-training models by predicting speech deterioration

We are looking to work with other researchers to investigate predicting changes in re-trained language models as a result of speech deterioration. Please get in touch if this is of interest.

Reference

1.Standard linguistic resources provided by York University for students:

https://www.york.ac.uk/media/languageandlinguistics/documents/currentstudents/linguisticsresources/Standardised-reading.pdf

2.Short sentences taken from the Google Euphonia project:

https://sites.research.google/euphonia/about/

3. Speech exercise sheet developed by the Parkinson’s Foundation

https://www.protocolit.co.uk/wp-content/uploads/2023/06/Exercising-Your-Speech-and-Voice-System.pdf

4.  Mobile Device Voice Recordings at King’s College London (MDVR-KCL) from both early and advanced Parkinson’s disease patients and healthy controls – May 17, 2019

H. Jaeger; D. Trivedi; M. Stadtschnitzer  [DOI : 10.5281/zenodo.2867216]

5.   Assessment of Speech Intelligibility in PD Using an STT System – ISSN : 2169-3536 IEEE 17th Oct 2017; 

G. Dimauro, V. di Nicola, V. Bevilacqua, D. Caivano, and F. Girardi. [DOI  ;10.1109 / ACCESS.2017.2762475] 

Conference poster

Please click the link to view the poster from the UK Speech 2023 conference.