Domain-Adaptive Speech Transcription and Comprehension
Automated Speech Recognition using Nvidia's Citrinet Model, deployed on Nvidia Riva
Last updated
Automated Speech Recognition using Nvidia's Citrinet Model, deployed on Nvidia Riva
Last updated
For this project, I was tasked to develop and deploy automated speech recgonition via Nvidia's Citrinet models specfically for medical terms and phrases for doctors and staff as part of my Final Year Project in poly. This was done in 2022
I was also tasked to integrate the newly developed models into the existing angular web application.
Generating tokenizer from wordlist
Citrinet uses sub-word tokenization which requires tokenizers to be generated to even train the model
Training/Finetuning
Main challenge here was adjusting the hyperparameters for best optimal results. Since I was just trying to fine-tune most of the time, I utilized a small batch size and low learning rate albeit this was more time-consuming
Other significant arguments include spec_argument which can be reduced to further speed up convergence
Evaluation of performance via Tensorboard
As the model is being fine-tuned/trained, you can see the how the word-error rate changes over the number of epochs, giving you a good estimate of where/what is going wrong
For further evaluation, Nvidia provides a python script for evaluating an ASR model with a variety of decoding modes.
Greedy WER/CER represents the WER the nemo model has with Greedy Decoder
WER/CER with beam search decoding and N-gram model represent the WER that the nemo model has w language model and flashlight decoder
Oracle WER/CER represent the best case WER for beam search decoding and N-gram model
Most of these scripts can be found on Nvidia's Nemo Repository here
My best result I obtained was 11.3% (Goal was below 10%).
Below is a brief outline of the steps I did
Starting and Setting up Riva Server
Creation of Streaming Client in Python3/Node js via GRPC
Integration into existing web application
Below is a diagram of the web application stack that we were using.
Web Application Stack Overview
Frontend (Client)
AngularJS (HTML, CSS, JS)
frontend framework
Backend
ExpressJS & NodeJS
used for app routing to various CRUD operations
Flask & Python
used for various API calls
Riva automatic speech transcription service (ASR)
BERT classification models from hugging face (NLP)
Cosmos DB
hosts the MongoDB on the cloud which stores user information & file information (E.g filenames, ids)
Azure Blob Storage (File storage)
stores files on the cloud (e.g audio recordings, PDFs, transcripts)
Some of the challenges I faced during deployment & integration
Streaming the audio from Angular to Python quickly without latency
Use of SocketsIO for low latency bidirectional streaming and appropriate buffering
Hard-to-read medical terms still being recognized as garbage
Use of word boosting to increase the chances of these medical terms to be more likely to be recognized
Likewise, negative word boosting can be done to decrease the chances of certain words to be recgonized
More information can be found here
Making the data presentable
Recognizing key phrases that are always in every transcript (E.g Past medical history) and adding the appropriate tags (E.g <h2>)
This was a hard project mainly due to the short time constraints (<3 months), sparse documentation at the time, and lastly skill-issue on my part. However, I was quite satisfied that my teammates and I were able to deliver a good working implementation of both NLP and ASR models in a integrated web application albeit not the most pretty. This was also one of the few projects that I did that was totally unrelated to cybersecurity and exposed me to the world of AI/ML so for that I am grateful.
Maybe in the near future, I would like to revisit this topic.
You can see a bad resolution image of the whole application over here.
PS - I no longer have access to the source code/models and web application sadly :(
Combined with my teammates NLP features, here is snippet of the final product! (sadly no demo video )
Here are my final presentation slides, special shoutout to my teammates who made working on the project much more enjoyable