By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Text-to-Rap AI Turns Lyrics Into Vocals, Gestures, and Facial Expressions | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Text-to-Rap AI Turns Lyrics Into Vocals, Gestures, and Facial Expressions | HackerNoon
Computing

Text-to-Rap AI Turns Lyrics Into Vocals, Gestures, and Facial Expressions | HackerNoon

News Room
Last updated: 2025/08/07 at 10:26 PM
News Room Published 7 August 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Text to Vocal Generation

    2.2 Text to Motion Generation

    2.3 Audio to Motion Generation

  2. RapVerse Dataset

    3.1 Rap-Vocal Subset

    3.2 Rap-Motion Subset

  3. Method

    4.1 Problem Formulation

    4.2 Motion VQ-VAE Tokenizer

    4.3 Vocal2unit Audio Tokenizer

    4.4 General Auto-regressive Modeling

  4. Experiments

    5.1 Experimental Setup

    5.2 Main Results Analysis and 5.3 Ablation Study

  5. Conclusion and References

A. Appendix

Given a piece of lyrics text, our goal is to generate rap-style vocals and whole-body motions, including body movements, hand gestures, and facial expressions that resonate with the lyrics. With the help of our RapVerse dataset, we propose a novel framework that not only represents texts, vocals, and motions as unified token forms but also integrates token modeling in a unified model. As illustrated in Fig. 3, our model consists of multiple tokenizers for motion (Sec. 4.2) and vocal (Sec. 4.3) token conversions, as well as a general Large Text-Motion-Audio Foundation Model (Sec. 4.4) that targets for audio token synthesize and motion token creation, based on rap lyrics.

4.1 Problem Formulation

Figure 3: Pipeline overview. We first pre-train all tokenizers on vocal-only and motion-only data. After we have pretrained the modality tokenizers, we can unify text, vocal, and motion in the same token space. We adopt a mixing organizing algorithm for input tokens to align via the temporal axis. These mixed input tokens are fed into the large Text-Motion-Audio foundation model to train on token prediction tasks, guided by the encoded features from textual input.Figure 3: Pipeline overview. We first pre-train all tokenizers on vocal-only and motion-only data. After we have pretrained the modality tokenizers, we can unify text, vocal, and motion in the same token space. We adopt a mixing organizing algorithm for input tokens to align via the temporal axis. These mixed input tokens are fed into the large Text-Motion-Audio foundation model to train on token prediction tasks, guided by the encoded features from textual input.

4.2 Motion VQ-VAE Tokenizer

4.3 Vocal2unit Audio Tokenizer

Overall, we leverage the self-supervised framework [45] in speech resynthesis domain to learn vocal representations from the audio sequences. Specifically, we train a Vocal2unit audio tokenizer to build a discrete tokenized representation for the human singing voice. The vocal tokenizer consists of three encoders and a vocoder. The encoders include three different parts: (1) the semantic encoder; (2) the F0 encoder; and (3) the singer encoder. We will introduce each component of the model separately.

4.4 General Auto-regressive Modeling

After optimizing via this training objective, our model learns to predict the next token, which can be decoded into different modality features. This process is similar to text word generation in language models, while the “word” in our method such as <face_02123>, does not have explicit semantic information, but can be decoded into continuous modality features.

Inference and Decoupling. In the inference stage, we use different start tokens to specify which modality to generate. The textual input is encoded as features to guide token inference. We also adopt a top-k algorithm to control the diversity of the generated content by adjusting the temperature, as generating vocals and motions based on lyrics is a creation process with multiple possible answers. After token prediction, a decoupling algorithm is used to process output tokens to make sure tokens from different modalities are separated and temporally aligned. These discrete tokens will be further decoded into text-aligned vocals and motions

Authors:

(1) Jiaben Chen, University of Massachusetts Amherst;

(2) Xin Yan, Wuhan University;

(3) Yihang Chen, Wuhan University;

(4) Siyuan Cen, University of Massachusetts Amherst;

(5) Qinwei Ma, Tsinghua University;

(6) Haoyu Zhen, Shanghai Jiao Tong University;

(7) Kaizhi Qian, MIT-IBM Watson AI Lab;

(8) Lie Lu, Dolby Laboratories;

(9) Chuang Gan, University of Massachusetts Amherst.


Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Huawei Is Working On An EV Battery That Can Last Over 1,800 Miles On A 5-Minute Charge – BGR
Next Article OpenAI’s GPT-5 is here and free for all ChatGPT users
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Jackery’s colossal Explorer 2000 Plus kit is nearly half price at Amazon — save over $3,000 right now
News
Japan thought he had touched back on his birth crisis. I didn’t know how wrong it was wrong
Mobile
Leaked Credentials Up 160%: What Attackers Are Doing With Them
Computing
Decart raises $100M on $3.1B valuation to grow real-time AI video platform – News
News

You Might also Like

Computing

Leaked Credentials Up 160%: What Attackers Are Doing With Them

9 Min Read
Computing

PCIe Improvements With Linux 6.17: Intel Panther Lake, Qualcomm, Sophgo SG2044 & More

2 Min Read
Computing

Jeff Atwood on Writing, Optimism, and Fixing the Internet | HackerNoon

28 Min Read
Computing

FFmpeg 8.0 Merges Vulkan AV1 Encoding & VP9 Decoding

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?