Stabilizing and Extending enviroCar: Voice Command

Introduction

As we approach the final lap of Google Summer of Code 2023, it’s the perfect time to reflect on the remarkable journey we’ve undertaken with the enviroCar: Voice Command project. Over the past 12 weeks, my focus has been on fortifying the existing codebase while also introducing new, valuable features. This led me to tackle captivating challenges over a wide range of tech stacks for the ultimate objective of an enhanced user experience with improved safety measures.

Project Goals

1. Improving the accuracy of wake word detection

Wake word detection in the enviroCar: Voice Command project was initially implemented with AimyBox and PocketSphinx speechkit as discussed in the previous blogs. This had its own set of advantages and drawbacks. While AimyBox provided a necessary framework for a voice assistant, PocketSphinx speechkit lacked the desired accuracy due to an old and unmaintained fleet of models. This low accuracy of wake word detection has been a major roadblock for the project, thus a solution to this challenge had high priority. Furthermore, the process of adapting a PocketSphinx model presented some significant challenges.

After analyzing a list of factors, we decided to proceed with an alternative, Vosk, which is more promising with rather frequent model updates and better model adaptation pipelines. I was able to perform language model adaptation on the Vosk model, which added “envirocar” as a word in the model vocabulary. This significantly increased the model performance to our call phrase “enviroCar listen”.

Although the model adaptation brought positive changes to the accuracy, it also introduced a fresh obstacle viz, the escalated size of the adapted model. While the original PocketSphinx model was around 30MB, the newly adapted Vosk model reached approximately 150MB, signifying a substantial increase. Including this model within the app package would disproportionately inflate the application’s size. This is certainly undesirable.

Language model pruning

Model pruning is a process of dropping some weight from the model, leading to a reduction in model size, but also impacting its accuracy. This technique can be strategically utilized to a certain degree where the usability of the model isn’t compromised.

The pruning of a Vosk model is done before adaptation. In our case, we had to re-perform the process of adaptation after pruning the model. I was able to perform the process in its entirety and got some valuable outcomes.

Model size after adaptation: 150 MB

Model size after pruning + adaptation: 80 MB

Here’s how it adds up to the process discussed in the last blog:

Language Model Adaptation Updated Process

Pruning the model yielded fruitful results, yet the size was not satisfactory. Furthermore, packaging the model files with the application package seemed expensive. So, to tackle these challenges, we decided on a framework for remote delivery of the model files to the app with a backend API and a model dashboard. There will be more on this in the later sections.

Notable PRs

2. Stabilizing the bot with best practices

As mentioned in the previous blogs, the voice command project had known issues that needed considerable attention. A set of boulders in the module was due to deviation from architectural best practices and some redundant and non-useful code. I was able to solve all the known issues by introducing techniques like Dependency Injection and making a few architectural enhancements for the voice bot.

While we did not use AimyBox’s UI Components, we had its ViewModel and other code deriving from these components in the module. Cleaning these simplified and gave a better code readability to the project.

A subset of problems were related to the handling of the AimyBox instance throughout the app. An AimyBox instance is an encompassing body responsible for handling wake word detection, speech recognition, Rasa API communication, and custom skills in the app. AimyBox was instantiated in the activities/fragments that utilized its functionalities. While it might look favorable at a glance, it led to design flaws, basically due to the attachment of the instance to the activity lifecycle. This resulted in multiple instantiations of AimyBox (#986). A proper solution to this involved making an AimyBox singleton and available to every activity by the application scope. For this, I utilized Dagger DI in structuring and implementing the required change. It addressed high-priority issues and provided a more robust structure with less coupling and increased modularity.

Notable PRs

Dagger DI implementation for BaseAimybox, fixes Aimybox multi-instantiation bug (#995)

3. Voice model dashboard

Expanding the usability of the voice command feature, we introduced Voice Model Dashboard with the primary objective of reducing the app size by not packaging the bulky model files with the app, but rather remotely delivering them while also giving more model options and control to the user. Some of the salient improvements offered by this dashboard are:

Reduced application size
Flexibility to download model only when using Voice Commands feature
Enabling delivery of model updates Over-The-Air (OTA)
Choice and download of best suited model
Options for other language models, such as German, in the future

The dashboard UI resonates with the enviroCar app’s theme. It has two RecyclerViews in the center, one for displaying the list of models available for download and the other for showcasing the downloaded ones. Here are some screen captures of the UI implemented in the app:

The dashboard and model management codes make use of Modern Android Development principles and libraries. It utilizes Jetpack components like ViewModels for handling configuration changes, Coroutines and ViewModelScope for background handling of network jobs, and Retrofit HTTP client for API network calls and model downloads.

Since the models are bulky, storing them in-memory during the downloading process raises serious concerns that might escalate to OutOfMemoryExceptions. To avoid these situations, the models are streamed to the app’s internal storage. It is implemented with Retrofit’s Streaming Annotation, which facilitates memory-efficient transfer of data by not converting data to bytes type, but handling it in its raw form.

Here’s a diagrammatic representation of the proposed working of the dashboard:

Notable PRs

Voice Model Dashboard (#996)

4. Model delivery API

The introduction of the Voice Model Dashboard prompted the need to obtain bulky model files by network calls. This is facilitated via a backend Voice Model Delivery API, designed to complement the dashboard. The API serves model-related information and facilitates the retrieval of model files hosted on a remote server, all achieved through HTTP network calls.

The delivery REST API is created on the ExpressJS framework with endpoints facilitating model file download and serving essential data for the model dashboard. It utilizes asynchronous pattern and can serve multiple requests simultaneously. More details and a list of all the endpoints can be found in the repository readme here. The following table briefly describes the list of available endpoints:

Furthermore, the Model Delivery API is developed according to microservices architecture. It is dockerized and can be deployed and automated easily in a few simple steps.

Notable Ref

enviroCar-model-delivery-api Repository

5. enviroCar Rasa Bot CI/CD pipeline

As described in the previous blogs, a code analyzer in the form of a CI/CD testing pipeline was implemented, tested, and deployed on the enviroCar-rasa-bot repository. This pipeline is built on a custom GitHub Action workflow, which is triggered by every new PR. It is built on top of Rasa’s official train-test-gha, hence it provides a detailed report of the effect of the changes proposed in a PR. It is also automated to post the report as a comment to the same pull request.

The implemented pipeline holds notable significance, poised to deliver valuable insights for forthcoming PRs that pertain to the introduction of fresh voice command features in the future. It also opens doors to a novel use case for rasa test stories.

If you would like to know more about my GSoC’ 23 work, please refer to my introduction blog, mid-term blog, and commits.

Notable PRs

Future Work

As a part of GSoC’ 23, we dedicated ourselves to both stabilizing and expanding the Voice Command project inside and out. We made significant strides and improvements throughout the project during the course of this program. There are still opportunities for additional enhancements in the scope of voice commands, which are recognized to be tackled in the future.

As the program concludes, my commitment to the project remains steadfast. I will continue my involvement in developing the current features, suggesting new additions, and resolving bugs. Below are some of the future improvements:

Voice Model Dashboard Enhancements
Snackbar UI Enhancements
Rasa Dialog API Optimizations
Addition of more voice commands
Writing more tests

Summary

The journey through GSoC’ 23 with enviroCar has been an incredibly nourishing and fruitful one. I got the opportunity to step out of my comfort zone to dive deep into technologies I was not very familiar with, from working on pruning, adapting, and improving machine learning language models to working with Rasa NLU or developing Github Actions CI/CD pipelines. As the project was architected and formalized recently in GSoC’ 22, I got a chance to work very closely with the design and the architecture of the implementation. This experience deepened my understanding of design concepts and allowed me to enhance the structure wherever necessary.

Working on a large codebase android application with a mixture of legacy and modern code added to my GSoC experience. It was a little overwhelming at first, but gradually it has molded a great sense of comprehension and appreciation of android fundamentals in me. I am really grateful for this experience that will always inspire me to build on my skills as a developer.

I would especially like to thank my mentor Dhiraj for his insights (he introduced VoiceCommands in GSoC’ 22) and for helping me out whenever I was stuck, be it code or design concepts. His guidance really helped me elevate my development skills. We engaged in agile methodologies and followed rigorous sprints, which has fostered positive development habits and enhanced my communication abilities which I will carry forward with me throughout.

In the end, I want to extend my heartfelt gratitude to 52° North for awarding me this incredible opportunity to dedicate my summer in contributing to the enviroCar Voice Command project as part of the Google Summer of Code 2023.

Let’s connect 👋: LinkedIn, Twitter

Introduction

Project Goals

1. Improving the accuracy of wake word detection

Language model pruning

Notable PRs

2. Stabilizing the bot with best practices

Notable PRs

3. Voice model dashboard

Notable PRs

4. Model delivery API

Notable Ref

5. enviroCar Rasa Bot CI/CD pipeline

Notable PRs

Future Work

Summary

Leave a Reply Cancel reply