Table of Links
Abstract and 1. Introduction
- Related Work
- Model
- Experiments
- Ablation Study
- Conclusion, Limitations, and Risks
5. Ablation Study
Semantic units and prompt acoustic units make a chain of S2ST prompts for SEAMLESSEXPRESSIVELM. A natural question to ask is how effective such prompt design is for speech LM training. Therefore we tried other prompting strategies as the ablation study.
5.1 Chain-of-Thought Prompting
This strategy provides essentially the same information as CoT prompting does for model training. It uses multi-task learning in place of multi-step reasoning. In Table 3, the row “no chain-of-thought” shows a drop of 9.82 and 4.38 in ASR-BLEU for Es-En and Hu-En respectively. It suggests that CoT helps the model with better semantic preservation in translation process.
To quantify the importance of semantic prompt in modeling, we experiment with another strategy by removing target semantic units from CoT prompting. Specifically, the model is trained to directly predict target acoustic units conditioned on source semantic units and prompt acoustic units. The row of “no semantic prompt” in Table 3 shows semantic degradation with a drop of 10.61 ASRBLEU in Es-En and 5.32 in Hu-En, suggesting that semantic prompt plays a critical role in providing semantic cues for S2ST modeling.
5.1.1 Semantic Prompt
The row of “no semantic prompt” in Table 3 shows semantic degradation with a drop of 10.61 ASR-BLEU in Es-En and 5.32 in Hu-En, suggesting that semantic prompt plays a critical role in providing semantic cues for S2ST modeling.
5.2 Acoustic Prompt
Moreover as a portion of target speech is taken as the acoustic prompt in SEAMLESSEXPRESSIVELM training, one important hyperparamter in this work is the prompt ratio. Taking Spanish-toEnglish as an example, we train multiple models with three sets of prompt ratio ranges: (0.20, 0.25), (0.25, 0.3) and (0.30, 0.35). For each train sample, a prompt ratio is uniformly selected from the given range. As for inference, we apply different prompt ratios to test samples, and measure how ASR-BLEU and VSim change with it.
As shown in Figure 2, the training prompt range (0.25, 0.30) achieves the highest ASR-BLEU with test prompt ratio of 0.3. Short acoustic prompt cannot provide sufficient acoustic information for the translation, while long acoustic prompt might encourage the model to copy and paste the prompt as it is taken from the target speech. Models trained with (0.25, 0.30) and (0.30, 0.35) achieve the best ASR-BLEU when the test prompt ratio is set as 0.30. For model trained with (0.20, 0.25), its ASRBLEU drops when test prompt ratio increases. We can see consistent improvement of VSim with increased test prompt in all three models.