Blog – Music X Lab Blog

This blog post introduces the “hidden methodology” behind many core works on deep music generation in Music X Lab. We hope it can help you better distill the main idea from technical details.
– Gus, Dec 2023

Background of automated music generation

For a long time, music generation has been modeled as a “sequence prediction” problem. As in estimating the stock price, the task of music generation is to predict a music sequence based on some contexts, say, to predict the melody given the underlying chords, or to predict the upcoming notes based on existing ones. However, we observed an intrinsic defect of such generation via prediction approach – in any given contexts, there are many ways/directions to develop the music, so which one shall the model follow?

Unfortunately, most data-driven approaches would not choose any particular direction, but somewhat an averaged version of all possibilities. This often leads to a mediocre music without a clear structure, and that is why most AI-made music still lacks the exploration and dynamic in genuinely creative works. Even worse, there is little room to interact with the black-box models except for sampling from the learned distribution repeatedly. From a musician’s perspective, composing a piece is certainly very different from predicting a stock’s price. What we need is not a “correct” estimate, but a creative choice. Moreover, we wish to interpret and control such choices as a way to extend our own musical expression.

A different philosophy – generation via analogy-making

A solution to the problems above lies in analogy-making — or, in a modern terms, music style transfer¹. The underlying idea is that most creation is not entirely new but a recombination of existing features (representations). A simple example (in the visual domain) is a “red rabbit” – a rare thing in nature but almost everyone can create an image in their mind by applying “red” (a concept) to “rabbit” (another concept). Similarly, if a model can learn useful music concepts, we can generate new music by making an analogy, e.g., what if piece A was re-composed/re-arranged using a different feature (e.g. chord, texture, form, etc.).

Assuming we have M music features and N pieces, each piece with a unique value for each feature. Through analogy-making, we can in theory create (N^M – N) new pieces by recombining the M features of the N pieces. That is a huge gain! From the lens of causality, such “generation via analogy-making” belongs to “counterfactual reasoning”, i.e., to imagine something non-existent based on observations.

Coming back to deep learning

The good news is that deep representation learning models are good feature learners, and we already see pioneer works in neural image style transfers. Now, it is our mission to work out solutions to: 1) learn abstract representations of music, 2) disentangle the representation into human-interpretable parts (concepts) using various inductive biases, and 3) control the generation by manipulating (sampling, interpolating, recombining, etc.) different music concepts.

Figure 1. The generation via analogy-making methodology: first do interpretable concepts learning via representation disentanglement, and then do controllable music generation.

The graph above is referred to as the “trinity of interpretable representation learning” in Music X Lab. Many of our core works follow such methodology — disentangling “melody contour” and “rhythmic pattern” of monophonic melody using EC²VAE², disentangling “chords” and “texture” of piano score using Poly-dis³, learning “piano texture” from both score and audio in A2S⁴, learning the “orchestration function” of multi-track polyphonic pieces in Q&A⁵, and disentangling “pitch” and “timbre” using a unified model for zero-shot source separation, transcription and synthesis⁶, etc. Several more recent works, e.g. whole-song generation⁷ and AccoMontage 3⁸, even applied interpretable representations in a hierarchical setup.

On unifying sequence generation and representation disentanglement

An even better news is that “generation via prediction” and “generation via analogy-making” do not necessarily conflict; rather, we can unify these two methodologies — a straightforward approach we often use is to throw the learned disentangled representations into whatever generative model (autoregressive, diffusion, masked language model, etc.). In other words, let the learned disentangled representations be the “language” of the generative model. The underlying idea is that interpretable concepts shall be useful features for (downstream) generation tasks.

We can either just use these representations as “controls”, or even better, also ask the models to predict these disentangled representations. Such ”representation-enhanced generation” has two major benefits. First, the (entire) generation process becomes more interpretable and controllable. Second, the generation results are usually much more coherent, as the model is now producing music feature-by-feature, measure-by-measure, and sometimes even phrase-by-phrase rather than naive note-by-note or midi-event by midi-event.

Here are some examples of controllable generation using disentangled representation: 1) piano arrangement by predicting texture based on a lead sheet using polydis³, 2) a more flexible control of chord and texture using Polyfussion⁹, 3) flexible music inpainting using long-term (4 bar to 8 bar) melody and rhythm representations¹⁰ ¹¹, and 4) automatic orchestration generation based on “orchestration function representation” and chords using AccoMontage 3⁸, and 5) whole-song generation⁷, a model that applies different feature controls on different levels of compositional hierarchy.

In the end, feature extraction and sequence modeling should go hand in hand. Concretely, we recently showed that predictive modeling can “return the favor” and help representation disentanglement — to use sequence prediction as an inductive bias for more interpretable disentanglement. The underlying idea is that a good representation should help us better predict the future. We will leave it as a rough idea in this blog, but you can check out SPS¹² and also a line of research on self-supervised learning in the last suggested follow-up reading.

Reference

G. Xia, S.Dai. “Music Style Transfer: A Position Paper,” 6th International Workshop on Musical Metacreation, Spain, June 2018. ↩︎
R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang and G. Xia. “Deep Music Analogy Via Latent Representation Disentanglement,” in Proc. 20th International Society for Music Information Retrieval Conference, Delft, Nov 2019. ↩︎
Z. Wang, D. Wang, Y. Zhang, G. Xia, “Learning Interpretable Representation for Controllable Polyphonic Music Generation,” in Proc. 21st International Society for Music Information Retrieval Conference, Montréal, Oct 2020. ↩︎
Z. Wang, D. Xu, G. Xia, Y. Shan, “Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning,” in Proc. 47th International Conference on Acoustics, Speech and Signal Processing, Singapore & Online, May 2022. ↩︎
J. Zhao, G. Xia, Y. Wang. “Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, August 2023. ↩︎
L. Lin, Q. Kong, J. Jiang, G. Xia. “A unified model for zero-shot music source separation, transcription and synthesis,” in Proc. 22nd International Society for Music Information Retrieval Conference, Online, Nov 2021. ↩︎
Under review. ↩︎
J. Zhao, G. Xia, Y. Wang. “AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style Transfer and Multi-Track Function Prior.” arXiv preprint arXiv:2310.16334, 2023. ↩︎
L. Min, J. Jiang, G. Xia, J. Zhao. “Polyffusion: A Diffusion Model for Polyphonic Score Generation With Internal and External Controls,”in Proc. 24th International Society for Music Information Retrieval Conference, Italy, Nov 2023. ↩︎
S. Wei, G. Xia, W. Gao, L. Lin, Y. Zhang, “Music Phrase Inpainting Using Long-term Representation and Contrastive Loss,” in Proc. 47th International Conference on Acoustics, Speech and Signal Processing, Singapore & Online, May 2022. ↩︎
S. Wei, Z. Wang, W. Gao, G. Xia, “Controllable Music Inpainting With Mixed-level and Disentangled Representation,” in Proc. 48th International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023. ↩︎
X. Liu, D. Chin, Y. Huang, G. Xia, “Learning Interpretable Low-dimensional Representation via Physical Symmetry,” in Advances in Neural Information Processing Systems, New Orleans, US, Dec 2023. ↩︎

“Music is the one incorporeal entrance into the higher world of knowledge,
which comprehends mankind but which mankind cannot comprehend.”
— Ludwig van Beethoven

Our Vision

Music manifests the most complex and subtle forms of creation of human minds. The composition process of music is almost free from the limitations of the physical world, fully leveraging imagination and creative intelligence. For this reason, Beethoven refers to music as “incorporeal” and refers to creative intelligence, the magnificent faculty accessible to humans, as the “higher world of knowledge.” From an AI perspective, the best way to uncover the mystery of creative intelligence is to realize it via computational efforts — to conceive being from void, to develop many from one, to construct whole from parts, to make analogy among seemingly distant scenarios, and music is a perfect subject of this endeavor.

On the other hand, the appreciation of music involves profound subjective experiences, especially aesthetic perception, which goes beyond utilities and cost functions that can be easily measured by static equations. The inner feelings, the dynamic notion of beauty, taste, good, and the self “I”, are what make ourselves “mankind” and what machines are yet to encompass. Hence, teaching machines to perceive structures, expressions and representations of music and to appreciate music with a taste is essentially to incorporate humanity into intelligent agents.

Our Teams and Projects

On the one hand, we are musicians, and we are curious about how indeed gifted musicians understand and create music. On the other hand, we are computer scientists and we believe that the best way to uncover the mystery of musicianship is to re-create it via computational efforts in a human-centered way. That is why we have been developing various intelligent systems that can help people better create, perform, and learn music.

Three of our most representative projects are: 1) deep music representation learning and style transfer, 2) human-computer interactive performance, and 3) computer-aided multimodal music tutoring. The first one is a new field (as well as a hot topic since 2018) that lies in the core of deep learning, relating to many other domains such as NLP and CV, and we were lucky to be one of the teams who laid the groundwork. The other two projects both have great practical value, and at the same time call for truly interdisciplinary efforts (music practice, educational psychology, hardware & interface design, real-time systems, etc.). We were proud to help promote them as the host of NIME2021 via the conference theme “learning to play, playing to learn.”

In a big picture, these three projects aim to seamlessly “close the loop” for the next-generation AI-extended music experience, in which I envision a workflow as follows: i) a user first sketches a short melody segment or a motif, ii) a music-generation agent extends it to a full song with accompaniment, while the user is free to transfer the style of any part of the piece, iii) a tutoring agent helps the user to learn to play the piece via interactive visual and haptic feedback, and finally, iv) the user and an accompaniment agent perform the performance on (a virtual) stage.

Analogy-making vs. prediction: a debate on the philosophy of automated music generation