Data scientists of Facebook have developed their claimed first ever largest automatic speech recognition (ASR) model of its kind. It is a model capable of understanding 51 different languages on training over 16,000 hours of voice recordings. From a paper published in this regard, the co-authors say that the system contains parameters that improve performance up to 28.8% on a benchmark in comparison with baselines, using this new model.
Usually, ASR engines could understand only a single language with the necessity of multiple such models for a voice assistant to communicate in more than one language. A single model design to recognize speech in multiple languages is most desirable in the domain of automatic speech recognition, for several reasons as it simplifies the bank end production pipeline. Facebook’s designed model puts all the languages into a single module. It essentially uses the hours of voice data collected from public and anonymized videos on Facebook, to analyze not only what someone is saying but also know what language they are speaking.
Studies have shown that training multilingual models of similar languages can lower the overall word error rate (WER).
The functioning of Facebook’s Speech Recognition model
Facebook’s speech recognition model also known as the ‘joint sequence-to-sequence model’, was trained by sharing the parameters from an encoder, decoder, and token set across all the languages. The input audio sequences are mapped to intermediate representations by the encoder while the decoder maps these representations to output text, and the token set simplifies the working process with many languages by sampling the sentences at different frequencies.
Researchers have divided the 51 languages into distinct groups with a different decoder to each language. Then, for each language set, they selected 10,000 sub-word units as the token set. In the next step, they manually combined a few smaller language groups until they ended up with a count of six in total. This prevented the group sizes from becoming overly skewed by the number of languages contained by them.
According to Facebook, there are a billion parameters for a language in the ASR model, which makes its speech recognition far better conventional models used.
How best is Facebook’s speech recognition model in comparison with others so far?
According to citations, authors explain that it is possible to train a massive single ASR architecture this work by Facebook is the first one ever to study multilingual systems on a massive scale. They found that, better than 51 different monolingual baselines, this way consumes considerably less time to tune.
Facebook’s interest to create a single model that can understand and communicate in numerous languages is beyond academics. Moreover, the company has been investing more in improving conversational AI on several fronts. Recently, it has launched a new open-source chatbot named ‘Blender’ which is more advanced than any rival, including Google’s new Meena chatbot. This Blender is designed to make a conversation with a user on any subject and show its affinity. Facebook wants to keep collecting voice data from users, for training the speech recognition engines. There is a necessity of a multilingual setup if Facebook wants to compete on the global platform. Though Alexa and Google Assistant could already speak numerous languages, they have limited multilingual modes. Depending on the speaker’s location, Alexa can identify and reply accordingly to those users who speak in English, Spanish, French, or Hindi while the Google assistant can be bilingual. It can assist your voice in English and any other language that is already spoken by it.
Hence, Facebook’s speech recognition model is proved to be the first ASR to tune and respond to a multitude of languages in the sphere of data science.
Usage stats and parameters
A training data set is created from the anonymized videos that are publicly shared on Facebook. These are divided into three categories:
1) High-resources languages (e.g.: English, Hindi, French) which consists of over 600 hours of training data.
2) Mid-resource languages (e.g.: Bengali, Japanese, Russian) with 300-500 hours of training data.
3) Low-resource (e.g.: Norwegian, Swahili, Lithuanian) with 100-150 hours of training data.
Following certain guidelines, these videos are transcribed and tuned with the model’s hyperparameters, or those parameters whose values are used to control the process of learning.
By several experiments, researchers from Facebook report that the best performing version of their model developed by them improved WER on an average by 9.1% for high-resource languages, by 12.44% for mid-resource languages, and by 28.76% for low-resource languages. Also, this model of ASR has performed well on low-resource languages which it hadn’t seen before and these languages include traditional Chinese, Persian, and Telugu.