Speech Analysis

Unmet needs in areas requiring machine learning


The number 1 loss of ability to speech in adults is stroke. A stroke happens roughly every 45 seconds across all people in the world. A stroke is a focal lesion, which can hit in any area of the brain. When it hits Broca’s area (responsible for speech production) or Wernicke’s area (responsible for speech formulation), speech is impaired.


A major shortfall of speech language therapists is in assessing and helping patents with autism.

Regardless of the medical problem, sensors and sensor analysis should be leveraged

As with other areas of behavioral health, there is a need for sensors for speech language therapy.

There is a lot of temporal data in speech signals. This means signal processing is helpful for gleaning insights from the data.

A common thread among signals is that they have a frequency spectrum. Examples include the actual speech waveforms themselves, as well as accelerometers that can measure the face and movement of other body parts as speech is formulated.

Current state of the art in sensor analysis for speech still has opportunity for improvement

Current state of the art in sensor analysis for segmenting and predicting speech patterns in autism patients has hovered around average precision values in the 0.6-0.7 range. Speech is often erratic, in the form of time series with missing values, so this is not terribly surprising.

Below, we’ll describe how speech is currently (typically) analyzed today.

Simply put, systems process signals. An example is a low-pass filter, where you take a messy signal with a lot of local variation and smooth it out. Here is an example from cleaning an ECG signal.

alt text

This is a typical signal processing task and it is useful for cleaning a signal before feeding it into a machine learning algorithm to learn patterns and make predictions on the signal. Other ways to clean a signal include fast fourier transform (FFT), which produces a frequency spectrum. If FFT is performed again on top of that, a power spectrum can be obtained and used for follow-on analysis.

Another technique you may perform after cleaning or smoothing out the signal with a band-pass filter or FFT is segmentation. Below is an example of segmenting a child talking. In the below image, signal from an audio recorder was extracted and segmented, allowing researchers to see which types of utterances or sounds a child is producing. Machine learning algorithms can take these segments and use them as predictive features for specific outcomes relevant to the person speaking.

alt text

Typically, the speech signals are segmented into sinusoidal components. These components can potentially be used for follow-on analysis, for example as predictive features in machine learning models for classifying the speaker’s sentiment or intentions.