By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Why Our Tiny Training Set Beat Giants in Cross-Lingual Speech Retrieval | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Why Our Tiny Training Set Beat Giants in Cross-Lingual Speech Retrieval | HackerNoon
Computing

Why Our Tiny Training Set Beat Giants in Cross-Lingual Speech Retrieval | HackerNoon

News Room
Last updated: 2025/04/04 at 10:26 PM
News Room Published 4 April 2025
Share
SHARE

Authors:

(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;

(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;

(3) Yun-hsuan Sung, Google Research;

(4) Daniel Cer, Google Research;

(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;

(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.

Table of Links

Abstract and 1 Introduction

2 Method

3 Data and Tasks

4 Model

5 Experiments

6 Related Work

7 Conclusion

8 Acknowledgements and References

A Appendix

5 Experiments

We train our DE model to perform S2T, where the task is to retrieve the corresponding transcription given a speech sample. We train on the 21 languages from CoVoST-2 and evaluate our model using the S2T portion of FLEURS in 102 languages.

5.1 Speech-to-Text Retrieval

Table 1 shows the average R@1 and WER for S2T for 102 languages from FLEURS. We compare against the mSLAM DE model from Conneau et al. (2023), a model trained on 426k hours of S2T data in 51 languages and fine-tuned on FLEURS training data. Our model significantly outperforms the mSLAM DE baseline in R@1 and W ER metrics despite being trained with only 1/10 of the data and having been initialized from a text-only LLM. More importantly, our model was only trained on the 21 languages in CoVoST-2 and never fine-tuned on the FLEURS training data.

5.1.1 Seen-Unseen Breakdown

In Figure 2 we break down the R@1 scores based on seen and unseen languages during training. We find that our model performs best on the 20 languages that are within the training and evaluation data, but still perform significantly well on the remaining 82 unseen languages. We hypothesize

Figure 2: R@1 transcription retrieval for seen and unseen languages in the training set.Figure 2: R@1 transcription retrieval for seen and unseen languages in the training set.

Table 2: FLEURS S2T (R@1) performance broken down by language groups. Bold represents better performance. Numbers in parenthesis represent the number of languages within the language group.Table 2: FLEURS S2T (R@1) performance broken down by language groups. Bold represents better performance. Numbers in parenthesis represent the number of languages within the language group.

this is due to the vast textual multilingual data our backbone LLM has seen during pre-training.

5.1.2 Language Group Breakdown

Table 2 shows the R@1 language group breakdown for S2T on FLEURS. We find that although we only trained on 21 languages, our model significantly outperforms mSLAM DE in 13 of the 15 language groups. These results are consistent with the experiments in Hassid et al. (2023) which explore the effect of initializing speech language models from pre-trained LLMs.

5.2 Evaluating on Cross-Modal and Cross-Lingual Tasks

We evaluate on S2TT to gauge the cross-modal and cross-lingual capabilities of our model. We show that we can further improve S2TT by simply training on a mixture of S2T and translation data without using any S2TT training data.

Figure 3: BLEU scores for FLEURS zero-shot S2TT when training on Transcripts or Transcripts + Translations for PaLM 2 DE. Combining transcripts and translation data improves zero-shot S2TT retrieval.Figure 3: BLEU scores for FLEURS zero-shot S2TT when training on Transcripts or Transcripts + Translations for PaLM 2 DE. Combining transcripts and translation data improves zero-shot S2TT retrieval.

5.2.1 Zero-Shot S2TT

Given the multi-lingual capabilities of our backbone language model, we explore if these capabilities are transferred after training our model contrastively on the S2T task. We hypothesize that our model should showcase cross-lingual and crossmodal capabilities due to the cross-modal training task and the cross-lingual capabilities of the backbone LLM. We evaluate S2TT in a zero-shot setting to assess our model’s performance retrieving English translations given a speech sample in another language. Using the FLEURS S2TT portion, we evaluate S2TT X → En in 4 languages: German, Polish, French, and Dutch.

Figure 3 shows BLEU S2TT performance using S2T CoVoST-2 in 21 languages. We call this setup Transcripts in Figure 3. Our results demonstrate that even when only training our model on speech and transcriptions, we can achieve some zero-shot S2TT performance and We find that S2TT BLEU scores are considerably higher for languages present S2T training data. For example, Polish was not in the S2T training therefore its BLEU scores are the lowest.

5.2.2 Improving S2TT with MT Data

To further improve our model’s cross-lingual performance, we add readily available translation data from Schwenk et al. (2019) to improve S2TT. For each batch, we combine 25% translation and 75% S2T data. Figure 3 shows comparison of only training on S2T (Transcripts) and combining S2T and translation data ( Transcriptions + Translations). We find that combining S2T and translation data significantly improves the S2TT BLEU scores in all 4 languages without training on S2TT data. This finding demonstrates that we can improve our models cross-lingual performance with highly accessible translation data without needing scarce and often expensive speech-totext translation training data.

The success of pre-trained LLMs have motivated the application of these models in different modalities. Lakhotia et al. (2021) transformed speech into pseudo-text units to introduce the task of generative spoken language modeling. Borsos et al. (2023) introduced a framework to generate audio with long-term consistency. Consequently, Hassid et al. (2023) showed that SpeechLMs benefit from being initialized from pre-train LLMs while Rubenstein et al. (2023) demonstrated that pre-trained LLMs can be adapted to various tasks that required text and speech understanding.

On the other hand, several works aim to build joint speech and text representations. Chung et al. (2021) introduced w2v-bert which combines masked language modeling and contrastive learning to create speech representations. Bapna et al. (2022) jointly pre-trains on speech and text from unsupervised speech and text data. Recently, Duquenne et al. (2023) employed separate speech and text encoders to generate embeddings in over 200 languages. Nevertheless, there is still a lack of understanding of whether joint speech and text representations can be built from a single encoder. We fill this gap by using pre-trained LLMs to jointly train on speech samples and their transcriptions to show that our approach is capable of speech-text matching in 102 languages.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Trading in your phone at AT&T? Watch out for this practice that leaves you with massive bills
Next Article Trump’s tariffs killed his TikTok deal
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Solenco Life Precision Bottle Wine Cooler
Gadget
Early Prime Day Speaker and Soundbar Deals Are Coming in Loud and Cheap
News
Redefine Your Space: Buffet or Sideboard? Choosing the Perfect Dining Storage for UAE Homes
Gadget
Joint security centre launched to combat university cyberattacks
Software

You Might also Like

Computing

👨🏿‍🚀 Daily – Starlink sets up shop in Lagos |

14 Min Read
Computing

How I cracked connecting my phone to my smart TV |

8 Min Read
Computing

The HackerNoon Newsletter: Is Generative AI a Blessing in Disguise for Journalism? (7/3/2025) | HackerNoon

3 Min Read
Computing

Perl 5.42 Released With New Operators, Unicode 16 Support, Security Fixes

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?