Domain-Adaptive Speech Transcription and Comprehension

Automated Speech Recognition using Nvidia's Citrinet Model, deployed on Nvidia Riva

For this project, I was tasked to develop and deploy automated speech recgonition via Nvidia's Citrinet models specfically for medical terms and phrases for doctors and staff as part of my Final Year Project in poly. This was done in 2022

I was also tasked to integrate the newly developed models into the existing angular web application.

Training/Fine-tuning of Citrinet

  1. Generating tokenizer from wordlist

    1. Citrinet uses sub-word tokenization which requires tokenizers to be generated to even train the model

  2. Training/Finetuning

    1. Main challenge here was adjusting the hyperparameters for best optimal results. Since I was just trying to fine-tune most of the time, I utilized a small batch size and low learning rate albeit this was more time-consuming

    2. Other significant arguments include spec_argument which can be reduced to further speed up convergence

    3. Evaluation of performance via Tensorboard

      1. As the model is being fine-tuned/trained, you can see the how the word-error rate changes over the number of epochs, giving you a good estimate of where/what is going wrong

      Plotting graphs with Tensorboard

    4. For further evaluation, Nvidia provides a python script for evaluating an ASR model with a variety of decoding modes.

      • Greedy WER/CER represents the WER the nemo model has with Greedy Decoder

      • WER/CER with beam search decoding and N-gram model represent the WER that the nemo model has w language model and flashlight decoder

      • Oracle WER/CER represent the best case WER for beam search decoding and N-gram model

    My best evaluated result

Most of these scripts can be found on Nvidia's Nemo Repository here

My best result I obtained was 11.3% (Goal was below 10%).

Deployment and Integration

Below is a brief outline of the steps I did

  1. Starting and Setting up Riva Server

  2. Integration into existing web application

Below is a diagram of the web application stack that we were using.

Cool Diagram

Web Application Stack Overview

  • Frontend (Client)

    • AngularJS (HTML, CSS, JS)

      • frontend framework

  • Backend

    • ExpressJS & NodeJS

      • used for app routing to various CRUD operations

    • Flask & Python

      • used for various API calls

        • Riva automatic speech transcription service (ASR)

        • BERT classification models from hugging face (NLP)

    • Cosmos DB

      • hosts the MongoDB on the cloud which stores user information & file information (E.g filenames, ids)

    • Azure Blob Storage (File storage)

      • stores files on the cloud (e.g audio recordings, PDFs, transcripts)

Some of the challenges I faced during deployment & integration

  • Streaming the audio from Angular to Python quickly without latency

    • Use of SocketsIO for low latency bidirectional streaming and appropriate buffering

  • Hard-to-read medical terms still being recognized as garbage

    • Use of word boosting to increase the chances of these medical terms to be more likely to be recognized

    • Likewise, negative word boosting can be done to decrease the chances of certain words to be recgonized

    • More information can be found here

  • Making the data presentable

    • Recognizing key phrases that are always in every transcript (E.g Past medical history) and adding the appropriate tags (E.g <h2>)

Combined with my teammates NLP features, here is snippet of the final product! (sadly no demo video 😭)

Reflections

This was a hard project mainly due to the short time constraints (<3 months), sparse documentation at the time, and lastly skill-issue on my part. However, I was quite satisfied that my teammates and I were able to deliver a good working implementation of both NLP and ASR models in a integrated web application albeit not the most pretty. This was also one of the few projects that I did that was totally unrelated to cybersecurity and exposed me to the world of AI/ML so for that I am grateful.

Maybe in the near future, I would like to revisit this topic.

Here are my final presentation slides, special shoutout to my teammates who made working on the project much more enjoyable 😎

You can see a bad resolution image of the whole application over here.

sigh

Last updated