> For the complete documentation index, see [llms.txt](https://xenon-2.gitbook.io/projects/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://xenon-2.gitbook.io/projects/readme/domain-adaptive-speech-transcription-and-comprehension.md).

# Domain-Adaptive Speech Transcription and Comprehension

For this project, I was tasked to develop and deploy automated speech recgonition via [Nvidia's Citrinet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_english_citrinet) models specfically for medical terms and phrases for doctors and staff as part of my Final Year Project in poly.  This was done in 2022

I was also tasked to integrate the newly developed models into the existing angular web application.

### Training/Fine-tuning of Citrinet

1. &#x20;Generating tokenizer from wordlist
   1. Citrinet uses [sub-word tokenization](https://www.geeksforgeeks.org/subword-tokenization-in-nlp/) which requires tokenizers to be generated to even train the model
2. Training/Finetuning

   1. Main challenge here was adjusting the hyperparameters for best optimal results. Since I was just trying to fine-tune most of the time, I utilized a small batch size and low learning rate albeit this was more time-consuming

   2. Other significant arguments include **spec\_argument** which can be reduced to further speed up convergence

   3. Evaluation of performance via Tensorboard

      1. As the model is being fine-tuned/trained, you can see the how the word-error rate changes over the number of epochs, giving you a good estimate of where/what is going wrong

      <figure><img src="/files/0LeX2OM6TPAqWnNfAqxp" alt=""><figcaption><p>Plotting graphs with Tensorboard</p></figcaption></figure>

   4. For further evaluation, Nvidia provides a python script for evaluating an ASR model with a variety of decoding modes.
      * Greedy WER/CER represents the WER the nemo model has with Greedy Decoder
      * WER/CER with beam search decoding and N-gram model represent the WER that the nemo model has w language model and flashlight decoder
      * Oracle WER/CER represent the best case WER for beam search decoding and N-gram model

   <figure><img src="/files/hK4wGf13NKQpxjVAZLy1" alt=""><figcaption><p>My best evaluated result</p></figcaption></figure>

{% hint style="info" %}
Most of these scripts can be found on Nvidia's Nemo Repository [here](https://github.com/NVIDIA/NeMo/tree/main)
{% endhint %}

My best result I obtained was 11.3% (Goal was below 10%).&#x20;

### Deployment and Integration

Below is a brief outline of the steps I did

1. Starting and Setting up [Riva Server](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html)
2. Creation of [Streaming Client in Python3/Node js ](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-basics.html#transcription-with-riva-asr-apis)via GRPC
3. Integration into existing web application&#x20;

Below is a diagram of the web application stack that we were using.

<figure><img src="/files/onjzgt5mCSEdZpVMGnAR" alt=""><figcaption><p>Cool Diagram</p></figcaption></figure>

**Web Application Stack Overview**

* **Frontend (Client)**
  * AngularJS (HTML, CSS, JS)
    * frontend framework
* **Backend**
  * **ExpressJS & NodeJS**
    * used for app routing to various CRUD operations
  * **Flask & Python**
    * used for various API calls
      * Riva automatic speech transcription service (ASR)
      * BERT classification models from hugging face (NLP)
  * **Cosmos DB**
    * hosts the MongoDB on the cloud which stores user information & file information (E.g filenames, ids)
  * **Azure Blob Storage (File storage)**
    * stores files on the cloud (e.g audio recordings, PDFs, transcripts)

Some of the challenges I faced during deployment & integration

* Streaming the audio from Angular to Python quickly without latency
  * Use of [SocketsIO](https://socket.io/) for low latency bidirectional streaming and appropriate buffering
* Hard-to-read medical terms still being recognized as garbage
  * Use of word boosting to increase the chances of these medical terms to be more likely to be recognized
  * Likewise, negative word boosting can be done to decrease the chances of certain words to be recgonized
  * More information can be found [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-wordboosting.html)
* Making the data presentable

  * Recognizing key phrases that are always in every transcript (E.g Past medical history) and adding the appropriate tags (E.g \<h2>)

Combined with my teammates NLP features, here is snippet of the final product! (sadly no demo video :sob:)&#x20;

<figure><img src="/files/yaoPYnq9yutRVeukZ21y" alt=""><figcaption></figcaption></figure>

### Reflections

This was a hard project mainly due to the short time constraints (<3 months), sparse documentation at the time, and lastly skill-issue on my part. However, I was quite satisfied that my teammates and I were able to deliver a good working implementation of both NLP and ASR models in a integrated web application albeit not the most pretty.\
\
This was also one of the few projects that I did that was totally unrelated to cybersecurity and exposed me to the world of AI/ML so for that I am grateful.

Maybe in the near future, I would like to revisit this topic.

Here are my final presentation slides, special shoutout to my teammates who made working on the project much more enjoyable :sunglasses:

You can see a bad resolution image of the whole application over here.&#x20;

<figure><img src="/files/MjfJbwmqpfeLDmlnk2Pc" alt=""><figcaption><p>sigh</p></figcaption></figure>

{% hint style="warning" %}
PS  - I no longer have access to the source code/models and web application sadly :(
{% endhint %}

{% embed url="<https://docs.google.com/presentation/d/e/2PACX-1vSHEgH75O_1iJI8ThuOcuThFgnGhOt8M0wzY0235DzkFmmaZbs7EXom9YoSpEq4izl_WZ7X_sfUtv4c/embed?delayms=3000&loop=false&start=false>" %}
