HM.AI: Better performance than Suno v5, Tencent and Tsinghua jointly released SongGeneration2: Overcoming the problem of pronunciation and debugging, supporting local deployment

On March 9, 2026, the music foundation model SongGeneration2, jointly developed by Tencent and Tsinghua University's Human Computer Voice Interaction Laboratory, was officially released. This news was like a heavy bomb, causing waves in the technology and music industry.

Technological Innovation: Targeting the Three Pain Points of AI Music

In the past, AI music often gave people a "plastic feeling" and there were many urgent problems that needed to be solved. The emergence of SongGeneration2 is like a sharp blade, precisely targeting these pain points.

High musicality: easy to handle complex arrangements
Traditional AI music often consists of simple melodic overlays, lacking depth and richness. SongGeneration2 is completely different, as it can handle complex multi track arrangements and create a strong sense of spatial hierarchy. Whether it's passionate rock or melodious classical music, they can be easily performed, making the audience feel as if they are in a professional music scene.
High lyric accuracy: Say goodbye to unclear pronunciation and off key lyrics
In AI music, unclear pronunciation and hallucinations that deviate from the key are common problems that seriously affect the quality of the music. SongGeneration2 has made significant breakthroughs in this area, with a phoneme error rate (PER) of only 8.55%. This data is not only significantly better than the top commercial model Suno v5 (12.4%), but only slightly inferior to MiniMax 2.5, greatly improving the accuracy and clarity of lyrics, allowing every line of lyrics to be clearly conveyed to the audience.
Extremely controllable: precise customization of style and emotions
Whether through text descriptions or audio prompts, SongGeneration2 can accurately follow instructions and deeply customize the style and emotions of music. Creators can create unique music works according to their own needs, meeting diverse creative needs.

Architecture innovation: driven by "dual core" to achieve excellence

The reason why SongGeneration2 can achieve such outstanding performance is due to its innovative hybrid LLM diffusion architecture.

Composition Brain (LeLM): Global Planning and Detail Control

LeLM is like an experienced composer, responsible for planning the overall structure and singing details of music. It can accurately grasp the rhythm, melody, and harmony of music, solve the key problem of "how to sing", and lay a solid foundation for musical works.

High fidelity renderer (Diffusion): Synthesizing complex acoustic details

Guided by language models, the Diffusion renderer is capable of synthesizing extremely complex acoustic details. It is like a skilled tuner, polishing every note perfectly, giving music works extremely high sound quality and realism.

Layered representation: Balancing melody and sound quality

SongGeneration2 pioneered a parallel modeling approach that combines mixed representation and multi track representation, balancing the stability of melody with the delicacy of sound quality. This unique architectural design allows music works to have both smooth melodies and rich timbres, as well as delicate emotional expressions.

Open source benefits: lowering the threshold for creativity and promoting nationwide composition

For the majority of developers, the open source of SongGeneration2 is undoubtedly a huge blessing. The SongGeneration-v2 large model with 4B parameters has been officially open sourced and supports multilingual generation in Chinese, English, and other languages.

Even more surprising is that it can run smoothly on consumer grade hardware equipped with 22GB of video memory, realizing the possibility of localized and private creation. This means that ordinary users can easily participate in music creation without the need for expensive professional equipment.

In order to enable users to experience the charm of SongGeneration-v2 faster, the project team has also launched the SongGeneration-v2 Fast version on HuggingFace. This version sacrifices a very small amount of sound quality in exchange for extremely fast generation - a complete single can be born in just one minute, greatly improving creative efficiency.

Summary: The era of "composers" for the whole nation may be coming

From the performance of SongGeneration2, it can be seen that AI music has officially entered the door of "commercial applications" from a "geek toy". With the open source of the Medium model and automated evaluation framework that supports 12GB video memory in the future, the threshold for AI music creation will be further lowered, and more people will have the opportunity to become "composers".