By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: A Single Prompt Will Have This AI Rapping and Dancing | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > A Single Prompt Will Have This AI Rapping and Dancing | HackerNoon
Computing

A Single Prompt Will Have This AI Rapping and Dancing | HackerNoon

News Room
Last updated: 2025/08/07 at 11:38 PM
News Room Published 7 August 2025
Share
SHARE

Authors:

(1) Jiaben Chen, University of Massachusetts Amherst;

(2) Xin Yan, Wuhan University;

(3) Yihang Chen, Wuhan University;

(4) Siyuan Cen, University of Massachusetts Amherst;

(5) Qinwei Ma, Tsinghua University;

(6) Haoyu Zhen, Shanghai Jiao Tong University;

(7) Kaizhi Qian, MIT-IBM Watson AI Lab;

(8) Lie Lu, Dolby Laboratories;

(9) Chuang Gan, University of Massachusetts Amherst.

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Text to Vocal Generation

    2.2 Text to Motion Generation

    2.3 Audio to Motion Generation

  2. RapVerse Dataset

    3.1 Rap-Vocal Subset

    3.2 Rap-Motion Subset

  3. Method

    4.1 Problem Formulation

    4.2 Motion VQ-VAE Tokenizer

    4.3 Vocal2unit Audio Tokenizer

    4.4 General Auto-regressive Modeling

  4. Experiments

    5.1 Experimental Setup

    5.2 Main Results Analysis and 5.3 Ablation Study

  5. Conclusion and References

A. Appendix

Abstract

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.

1 Introduction

In the evolving landscape of multi-modal content generation in terms of sound and motion, significant strides have been made in individual modalities, including text-to-music [54, 1, 21], text-to-vocal [32], text-to-motion [13, 69, 4, 23, 34], and audio-to-motion [68, 15, 31] generation. These developments have paved the way for creating more dynamic and interactive digital content. Despite these advancements, existing works predominantly operate in silos, addressing each modality in isolation. However, there’s strong psychological evidence that for human beings, the generation of sound and motion are highly related and coupled [28]. A unified system for joint generation allows for a more expressive and nuanced communication of emotions, intentions, and context, where the generation of one modality could guide and assist the other in a coherent and efficient way.

In this paper, we tackle a crucial problem: can a machine not only sing with emotional depth but also perform with human-like expressions and motions? We propose a novel task for generating coherent singing vocals and whole-body human motions (including body motions, hand gestures, and facial expressions) simultaneously, see Fig. 1. This endeavor holds practical significance in fostering more immersive and naturalistic digital interactions, thereby elevating virtual performances, interactive gaming, and the realism of virtual avatars.

Figure 1: RapVerse. We present a unified text-conditioned multi-modality generation framework, for jointly generating holistic body motions and singing vocals from textual lyrics inputs only. Note that the corresponding video frames are just shown for reference.Figure 1: RapVerse. We present a unified text-conditioned multi-modality generation framework, for jointly generating holistic body motions and singing vocals from textual lyrics inputs only. Note that the corresponding video frames are just shown for reference.

An important question naturally arises: what constitutes a good model for unified generation of sound and motion? Firstly, we consider textual lyrics as the proper form of inputs for the unified system, since text provides a highly expressive, interpretable flexible means of conveying information by human beings, and could serve as a bridge between various modalities. Previous efforts explore scores [32], action commands [69, 4, 23], or audio signals [68] as inputs, which are inferior to textual inputs in terms of semantic richness, expressiveness and flexible integration of different modalities.

Secondly, we reckon that a joint generation system that could produce multi-modal outputs simultaneously is better than a cascaded system that executes the single-modal generation sequentially. A cascaded system, combining a text-to-vocal module with a vocal-to-motion module, risks accumulating errors across each stage of generation. For instance, a misinterpretation in the text-to-vocal phase can lead to inaccurate motion generation, thereby diluting the intended coherence of the output. Furthermore, cascaded architectures necessitate multiple training and inference phases across different models, substantially increasing computational demands.

To build such a joint generation system, the primary challenges include: 1) the scarcity of datasets that provide lyrics, vocals, and 3D whole-body motion annotations simultaneously; and 2) the need for a unified architecture capable of coherently synthesizing vocals and motions from text. In response to these challenges, we have curated RapVerse, a large-scale dataset featuring a comprehensive collection of lyrics, singing vocals, and 3D whole-body motions. Despite the existence of datasets available for text-to-vocal [32, 22, 8, 55], text-to-motion [44, 35, 13, 30], and audio-to-motion [3, 15, 12, 9, 5, 65], the landscape lacks a unified dataset that encapsulates singing vocals, wholebody motion, and lyrics simultaneously. Most notably, large text-to-vocal datasets [22, 70] are predominantly in Chinese, limiting their applicability for English language research and lacking any motion data. And text-to-motion datasets [44, 13, 30] typically focus on text descriptions of specific actions paired with corresponding motions without audio data, often not covering whole body movements. Moreover, audio-to-motion datasets [32, 33] focus primarily on speech rather than singing. A comparison of existing related datasets is demonstrated in Table. 1. The RapVerse dataset is divided into two distinctive parts to cater to a broad range of research needs: 1) a Rap-Vocal subset containing a large number of pairs of vocals and lyrics, and 2) a Rap-Motion subset encompassing vocals, lyrics, and human motions. The Rap-Vocal subset contains 108.44 hours of high-quality English singing voice in the rap genre without background music. Paired lyrics and vocals are crawled from the Internet from 32 singers, with careful cleaning and post-processing. On the other hand, the Rap-Motion subset contains 26.8 hours of rap performance videos with 3D holistic body mesh annotations in SMPL-X parameters [42] using the annotation pipeline of Motion-X [30], synchronous singing vocals and corresponding lyrics.

With the RapVerse dataset, we explore how far we can push by simply scaling autoregressive multimodal transformers with language, audio, and motion for a coherent and realistic generation of vocals and whole-body human motions. To this end, we unify different modalities as token representations. Specifically, three VQVAEs [63] are utilized to compress whole-body motion sequences into three-level discrete tokens for head, body, and hand, respectively. For vocal generation, previous works [37, 7, 32, 37] share a common paradigm, producing mel-spectrograms of audio signals from input textual features and additional music score information, following with a vocoder [40, 62, 67] to reconstruct the phase. We draw inspiration from the speech resynthesis domain [45], and learn a self-supervised discrete representation to quantize raw audio signal into discrete tokens while preserving the vocal content and prosodic information. Then, with all the inputs in discrete representations, we leverage a transformer to predict the discrete codes of audio and motion in an autoregressive fashion. Extensive experiments demonstrate that this straightforward unified generation framework not only produces realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems.

To sum up, this paper makes the following contributions:

• We release RapVerse, a large dataset featuring synchronous singing vocals, lyrics, and high-quality 3D holistic SMPL-X parameters.

• We design a simple but effective unified framework for the joint generation of singing vocals and human motions from text with a multi-modal transformer in an autoregressive fashion.

• To unify representations of different modalities, we employ a vocal-to-unit model to obtain quantized audio tokens and utilize compositional VQVAEs to get discrete motion tokens.

• Experimental results show that our framework rivals the performance of specialized single-modality generation systems, setting new benchmarks for joint generation of vocals and motion.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article The Apple TV+ Library Keeps Getting Better – Here Are The 10 Most Popular Shows Right Now – BGR
Next Article US court data exposed in massive hack
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

How to Use Time Boxing to Improve Productivity |
Computing
Pixel 10 leaked promo videos offer a peek at the hardware we want most
News
Xiaomi dismisses mass layoff speculation · TechNode
Computing
Twilio’s stock tanks on tepid earnings forecast – News
News

You Might also Like

Computing

How to Use Time Boxing to Improve Productivity |

20 Min Read
Computing

Xiaomi dismisses mass layoff speculation · TechNode

4 Min Read
Computing

Alibaba and Meituan back new Chinese AI startup valued at $2.5 billion valuation · TechNode

1 Min Read
Computing

Nvidia hires ex-Xpeng staffers to boost its autonomous driving business in China · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?