Domain-Adaptive Speech Transcription and Comprehension

Automated Speech Recognition using Nvidia's Citrinet Model, deployed on Nvidia Riva

For this project, I was tasked to develop and deploy automated speech recgonition via Nvidia's Citrinet models specfically for medical terms and phrases for doctors and staff as part of my Final Year Project in poly. This was done in 2022

I was also tasked to integrate the newly developed models into the existing angular web application.

Training/Fine-tuning of Citrinet

Generating tokenizer from wordlist
1. Citrinet uses sub-word tokenization which requires tokenizers to be generated to even train the model
Training/Finetuning
1. Main challenge here was adjusting the hyperparameters for best optimal results. Since I was just trying to fine-tune most of the time, I utilized a small batch size and low learning rate albeit this was more time-consuming
2. Other significant arguments include spec_argument which can be reduced to further speed up convergence
3. Evaluation of performance via Tensorboard
  1. As the model is being fine-tuned/trained, you can see the how the word-error rate changes over the number of epochs, giving you a good estimate of where/what is going wrong
  Plotting graphs with Tensorboard
4. For further evaluation, Nvidia provides a python script for evaluating an ASR model with a variety of decoding modes.
  - Greedy WER/CER represents the WER the nemo model has with Greedy Decoder
  - WER/CER with beam search decoding and N-gram model represent the WER that the nemo model has w language model and flashlight decoder
  - Oracle WER/CER represent the best case WER for beam search decoding and N-gram model
My best evaluated result

Most of these scripts can be found on Nvidia's Nemo Repository here

My best result I obtained was 11.3% (Goal was below 10%).

Deployment and Integration

Below is a brief outline of the steps I did

Starting and Setting up Riva Server
Creation of Streaming Client in Python3/Node js via GRPC
Integration into existing web application

Below is a diagram of the web application stack that we were using.

Web Application Stack Overview

Frontend (Client)
- AngularJS (HTML, CSS, JS)
  - frontend framework
Backend
- ExpressJS & NodeJS
  - used for app routing to various CRUD operations
- Flask & Python
  - used for various API calls
    Riva automatic speech transcription service (ASR)
    BERT classification models from hugging face (NLP)
- Cosmos DB
  - hosts the MongoDB on the cloud which stores user information & file information (E.g filenames, ids)
- Azure Blob Storage (File storage)
  - stores files on the cloud (e.g audio recordings, PDFs, transcripts)

Some of the challenges I faced during deployment & integration

Streaming the audio from Angular to Python quickly without latency
- Use of SocketsIO for low latency bidirectional streaming and appropriate buffering
Hard-to-read medical terms still being recognized as garbage
- Use of word boosting to increase the chances of these medical terms to be more likely to be recognized
- Likewise, negative word boosting can be done to decrease the chances of certain words to be recgonized
- More information can be found here
Making the data presentable
- Recognizing key phrases that are always in every transcript (E.g Past medical history) and adding the appropriate tags (E.g <h2>)

Combined with my teammates NLP features, here is snippet of the final product! (sadly no demo video 😭)

Reflections

This was a hard project mainly due to the short time constraints (<3 months), sparse documentation at the time, and lastly skill-issue on my part. However, I was quite satisfied that my teammates and I were able to deliver a good working implementation of both NLP and ASR models in a integrated web application albeit not the most pretty. This was also one of the few projects that I did that was totally unrelated to cybersecurity and exposed me to the world of AI/ML so for that I am grateful.

Maybe in the near future, I would like to revisit this topic.

Here are my final presentation slides, special shoutout to my teammates who made working on the project much more enjoyable 😎

You can see a bad resolution image of the whole application over here.

PS - I no longer have access to the source code/models and web application sadly :(

PreviousEnterprise Security

Last updated 9 months ago