Domain-Adaptive Speech Transcription and Comprehension

Automated Speech Recognition using Nvidia's Citrinet Model, deployed on Nvidia Riva

For this project, I was tasked to develop and deploy automated speech recgonition via Nvidia's Citrinetarrow-up-right models specfically for medical terms and phrases for doctors and staff as part of my Final Year Project in poly. This was done in 2022

I was also tasked to integrate the newly developed models into the existing angular web application.

Training/Fine-tuning of Citrinet

  1. Generating tokenizer from wordlist

    1. Citrinet uses sub-word tokenizationarrow-up-right which requires tokenizers to be generated to even train the model

  2. Training/Finetuning

    1. Main challenge here was adjusting the hyperparameters for best optimal results. Since I was just trying to fine-tune most of the time, I utilized a small batch size and low learning rate albeit this was more time-consuming

    2. Other significant arguments include spec_argument which can be reduced to further speed up convergence

    3. Evaluation of performance via Tensorboard

      1. As the model is being fine-tuned/trained, you can see the how the word-error rate changes over the number of epochs, giving you a good estimate of where/what is going wrong

      Plotting graphs with Tensorboard

    4. For further evaluation, Nvidia provides a python script for evaluating an ASR model with a variety of decoding modes.

      • Greedy WER/CER represents the WER the nemo model has with Greedy Decoder

      • WER/CER with beam search decoding and N-gram model represent the WER that the nemo model has w language model and flashlight decoder

      • Oracle WER/CER represent the best case WER for beam search decoding and N-gram model

    My best evaluated result
circle-info

Most of these scripts can be found on Nvidia's Nemo Repository herearrow-up-right

My best result I obtained was 11.3% (Goal was below 10%).

Deployment and Integration

Below is a brief outline of the steps I did

  1. Starting and Setting up Riva Serverarrow-up-right

  2. Integration into existing web application

Below is a diagram of the web application stack that we were using.

Cool Diagram

Web Application Stack Overview

  • Frontend (Client)

    • AngularJS (HTML, CSS, JS)

      • frontend framework

  • Backend

    • ExpressJS & NodeJS

      • used for app routing to various CRUD operations

    • Flask & Python

      • used for various API calls

        • Riva automatic speech transcription service (ASR)

        • BERT classification models from hugging face (NLP)

    • Cosmos DB

      • hosts the MongoDB on the cloud which stores user information & file information (E.g filenames, ids)

    • Azure Blob Storage (File storage)

      • stores files on the cloud (e.g audio recordings, PDFs, transcripts)

Some of the challenges I faced during deployment & integration

  • Streaming the audio from Angular to Python quickly without latency

  • Hard-to-read medical terms still being recognized as garbage

    • Use of word boosting to increase the chances of these medical terms to be more likely to be recognized

    • Likewise, negative word boosting can be done to decrease the chances of certain words to be recgonized

    • More information can be found herearrow-up-right

  • Making the data presentable

    • Recognizing key phrases that are always in every transcript (E.g Past medical history) and adding the appropriate tags (E.g <h2>)

Combined with my teammates NLP features, here is snippet of the final product! (sadly no demo video 😭)

Reflections

This was a hard project mainly due to the short time constraints (<3 months), sparse documentation at the time, and lastly skill-issue on my part. However, I was quite satisfied that my teammates and I were able to deliver a good working implementation of both NLP and ASR models in a integrated web application albeit not the most pretty. This was also one of the few projects that I did that was totally unrelated to cybersecurity and exposed me to the world of AI/ML so for that I am grateful.

Maybe in the near future, I would like to revisit this topic.

Here are my final presentation slides, special shoutout to my teammates who made working on the project much more enjoyable 😎

You can see a bad resolution image of the whole application over here.

sigh
circle-exclamation

Last updated