Analogy-making vs. prediction: a debate on the philosophy of automated music generation

This blog post introduces the “hidden methodology” behind many core works on deep music generation in Music X Lab. We hope it can help you better distill the main idea from technical details.
– Gus, Dec 2023

Background of automated music generation

For a long time, music generation has been modeled as a “sequence prediction” problem. As in estimating the stock price, the task of music generation is to predict a music sequence based on some contexts, say, to predict the melody given the underlying chords, or to predict the upcoming notes based on existing ones. However, we observed an intrinsic defect of such generation via prediction approach – in any given contexts, there are many ways/directions to develop the music, so which one shall the model follow?

Unfortunately, most data-driven approaches would not choose any particular direction, but somewhat an averaged version of all possibilities. This often leads to a mediocre music without a clear structure, and that is why most AI-made music still lacks the exploration and dynamic in genuinely creative works. Even worse, there is little room to interact with the black-box models except for sampling from the learned distribution repeatedly. From a musician’s perspective, composing a piece is certainly very different from predicting a stock’s price. What we need is not a “correct” estimate, but a creative choice. Moreover, we wish to interpret and control such choices as a way to extend our own musical expression.

A different philosophy – generation via analogy-making

A solution to the problems above lies in analogy-making — or, in a modern terms, music style transfer¹. The underlying idea is that most creation is not entirely new but a recombination of existing features (representations). A simple example (in the visual domain) is a “red rabbit” – a rare thing in nature but almost everyone can create an image in their mind by applying “red” (a concept) to “rabbit” (another concept). Similarly, if a model can learn useful music concepts, we can generate new music by making an analogy, e.g., what if piece A was re-composed/re-arranged using a different feature (e.g. chord, texture, form, etc.).

Assuming we have M music features and N pieces, each piece with a unique value for each feature. Through analogy-making, we can in theory create (N^M – N) new pieces by recombining the M features of the N pieces. That is a huge gain! From the lens of causality, such “generation via analogy-making” belongs to “counterfactual reasoning”, i.e., to imagine something non-existent based on observations.

Coming back to deep learning

The good news is that deep representation learning models are good feature learners, and we already see pioneer works in neural image style transfers. Now, it is our mission to work out solutions to: 1) learn abstract representations of music, 2) disentangle the representation into human-interpretable parts (concepts) using various inductive biases, and 3) control the generation by manipulating (sampling, interpolating, recombining, etc.) different music concepts.

Figure 1. The generation via analogy-making methodology: first do interpretable concepts learning via representation disentanglement, and then do controllable music generation.

The graph above is referred to as the “trinity of interpretable representation learning” in Music X Lab. Many of our core works follow such methodology — disentangling “melody contour” and “rhythmic pattern” of monophonic melody using EC²VAE², disentangling “chords” and “texture” of piano score using Poly-dis³, learning “piano texture” from both score and audio in A2S⁴, learning the “orchestration function” of multi-track polyphonic pieces in Q&A⁵, and disentangling “pitch” and “timbre” using a unified model for zero-shot source separation, transcription and synthesis⁶, etc. Several more recent works, e.g. whole-song generation⁷ and AccoMontage 3⁸, even applied interpretable representations in a hierarchical setup.

On unifying sequence generation and representation disentanglement

An even better news is that “generation via prediction” and “generation via analogy-making” do not necessarily conflict; rather, we can unify these two methodologies — a straightforward approach we often use is to throw the learned disentangled representations into whatever generative model (autoregressive, diffusion, masked language model, etc.). In other words, let the learned disentangled representations be the “language” of the generative model. The underlying idea is that interpretable concepts shall be useful features for (downstream) generation tasks.

We can either just use these representations as “controls”, or even better, also ask the models to predict these disentangled representations. Such ”representation-enhanced generation” has two major benefits. First, the (entire) generation process becomes more interpretable and controllable. Second, the generation results are usually much more coherent, as the model is now producing music feature-by-feature, measure-by-measure, and sometimes even phrase-by-phrase rather than naive note-by-note or midi-event by midi-event.

Here are some examples of controllable generation using disentangled representation: 1) piano arrangement by predicting texture based on a lead sheet using polydis³, 2) a more flexible control of chord and texture using Polyfussion⁹, 3) flexible music inpainting using long-term (4 bar to 8 bar) melody and rhythm representations¹⁰ ¹¹, and 4) automatic orchestration generation based on “orchestration function representation” and chords using AccoMontage 3⁸, and 5) whole-song generation⁷, a model that applies different feature controls on different levels of compositional hierarchy.

In the end, feature extraction and sequence modeling should go hand in hand. Concretely, we recently showed that predictive modeling can “return the favor” and help representation disentanglement — to use sequence prediction as an inductive bias for more interpretable disentanglement. The underlying idea is that a good representation should help us better predict the future. We will leave it as a rough idea in this blog, but you can check out SPS¹² and also a line of research on self-supervised learning in the last suggested follow-up reading.

Reference

G. Xia, S.Dai. “Music Style Transfer: A Position Paper,” 6th International Workshop on Musical Metacreation, Spain, June 2018. ↩︎
R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang and G. Xia. “Deep Music Analogy Via Latent Representation Disentanglement,” in Proc. 20th International Society for Music Information Retrieval Conference, Delft, Nov 2019. ↩︎
Z. Wang, D. Wang, Y. Zhang, G. Xia, “Learning Interpretable Representation for Controllable Polyphonic Music Generation,” in Proc. 21st International Society for Music Information Retrieval Conference, Montréal, Oct 2020. ↩︎
Z. Wang, D. Xu, G. Xia, Y. Shan, “Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning,” in Proc. 47th International Conference on Acoustics, Speech and Signal Processing, Singapore & Online, May 2022. ↩︎
J. Zhao, G. Xia, Y. Wang. “Q&A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, August 2023. ↩︎
L. Lin, Q. Kong, J. Jiang, G. Xia. “A unified model for zero-shot music source separation, transcription and synthesis,” in Proc. 22nd International Society for Music Information Retrieval Conference, Online, Nov 2021. ↩︎
Under review. ↩︎
J. Zhao, G. Xia, Y. Wang. “AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style Transfer and Multi-Track Function Prior.” arXiv preprint arXiv:2310.16334, 2023. ↩︎
L. Min, J. Jiang, G. Xia, J. Zhao. “Polyffusion: A Diffusion Model for Polyphonic Score Generation With Internal and External Controls,”in Proc. 24th International Society for Music Information Retrieval Conference, Italy, Nov 2023. ↩︎
S. Wei, G. Xia, W. Gao, L. Lin, Y. Zhang, “Music Phrase Inpainting Using Long-term Representation and Contrastive Loss,” in Proc. 47th International Conference on Acoustics, Speech and Signal Processing, Singapore & Online, May 2022. ↩︎
S. Wei, Z. Wang, W. Gao, G. Xia, “Controllable Music Inpainting With Mixed-level and Disentangled Representation,” in Proc. 48th International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023. ↩︎
X. Liu, D. Chin, Y. Huang, G. Xia, “Learning Interpretable Low-dimensional Representation via Physical Symmetry,” in Advances in Neural Information Processing Systems, New Orleans, US, Dec 2023. ↩︎

Analogy-making vs. prediction: a debate on the philosophy of automated music generation

Background of automated music generation

A different philosophy – generation via analogy-making

Coming back to deep learning

On unifying sequence generation and representation disentanglement

Suggested follow-up reading

Reference

Join the Conversation

Leave a comment

Cancel reply