(Hypebot) — Artificial intelligence (AI) is transforming the art world. But will it do the same for music creation and consumption, and what could the consequences be?
I’ve been getting this question a LOT recently. With news of Dall-E and Midjourney churning out mind-blowing visual art from a short text prompt, the obvious question is; why not music?
There is a lot to get excited about as media and art forms continue to get the “AI treatment.” My team and I have been working in music automation since 2019, building an engine that scales song creation through AI production to build a better royalty-free music service for video creators.
As the conversation around generative art heats up, I’d like to share how we’re seeing the industry evolve from our vantage point, where the challenges lie, and what we’re most excited about.
Defining “AI music”
I put it in quotations because just like the term “metaverse”, “AI music” means different things to different people and has become a catchall phrase. However, there are only a few real applications of AI as it relates to music today:
- AI to aid music and audio production. Examples include Google’s Magenta plug-ins which use AI to help producers make music, iZotope’s RX9 which restores audio using machine learning or AudioShake which uses AI for audio stem separation.
- AI music algorithms based on neural networks (traditional machine learning), which presumably learn from open source datasets of music in MIDI format. This is the same methodology that DALL-E and Midjourney use to train their algorithms, and the music data is “synthesized” into audio files you can hear in near real time. Examples include Jukebox by OpenAI, AIVA and Endel.
- AI driven hybrid models where certain functions are automated while some functions or inputs remain human-made. Examples include Splice’s CoSo app and Tuney.
Challenges to Overcome
Looking at the “AI Music” categories above, each one has its limitations when it comes to spitting out great-sounding results with consistency.
AI used to aid music production requires an experienced producer or audio engineer to deliver the finished song. The tools help to kick-start the creative process with generated ideas and help finish the process with AI mixing, mastering, and audio processing. But the core part of writing, recording, producing, and finishing a great song is still very much on the artist’s shoulders. Most of the AI projects that generated early buzz around AI music with a great-sounding finished product, like the Lost Tapes of the 27 Club, required a LOT of human effort to finish, with AI playing a smaller role in the overall creation.
AI music algorithms based on neural networks are generating music from nothing, but with very limited quality so far. While there are countless subjective quality measurements in music (e.g. “it’s got a vibe”), the two objective ones that AI struggles with are:
- Sonic quality – while the technology that generates the music ideas is quite sophisticated, synthesizing real instruments using software is still a ways off from a perfect copy of the real thing. Furthermore, the best software instruments are under lock and key, protected by the $3B music software industry. Most AI music projects are using open-source or self-developed instrument sounds, and they don’t come close to even the best software instruments, let alone live instruments recorded in a studio. Sonic quality has an effect on how music sounds and feels, which is what separates a professionally-produced song from the rest.
- Compositional quality – music connects with humans emotionally, and the music business is largely monetizing the way music makes you feel. Artists communicate their emotions through their instruments, picking the melody and chord combinations that represent the way they’re feeling during the songwriting process, which in turn resonates with listeners. Since AI can’t do this, it generates random melodies and chords based on music theory rules but can’t match a desired emotion or mood without a significant amount of human intervention. Music is also less forgiving. A pixel here or there in the visual world won’t be scrutinized as much, nor will it take away from the whole piece the way that a sour note can. If you’re off by one note anywhere within a song, though, the listener will know it instantly.
AI-driven hybrid models are the most exciting for me right now because they are the best of both worlds. A way to use AI in a practical sense to scale creativity in music. Hybrid models allow us to experiment with automation where it makes sense, while maintaining the human touch in other areas to maximize quality. The limitation here is scale, because human touch implies a potential limitation of scale. But being purposeful with where the human touch comes in, we can use these models to scale human creativity instead of replacing it. Consequently, this is what Tuney’s mission is.
How can we make great music that has commercial viability and interactivity in the digital world, with minimal human effort and maximum return? A hybrid model automating the song finishing process allows artists to make more money while doing less work while also enabling non-musicians to interact with music in a whole new way.
Is anyone close to a “Midjourney for music?”
Yes and no. Yes, in that there are great music automation projects that allow us to generate songs in some capacity using a set of text and audio inputs.
But also no, because the specific format of true “text to art” is a bit trickier with music for 3 reasons:
Firstly, music inputs are more subjective than visual inputs. A hat is a hat. Red is red. Even has a numerical value! But a happy song, or a “vibey groove” can mean a lot of different things depending on who the listener is.
Secondly, audio synthesis is far more complex than visual synthesis. Pixels are objective in terms of color and placement when defining different objects, subjects and even painting styles. But synthesizing a decent-sounding cello from thin air is something the synthesizer industry has been working on for decades, and “fake strings” still sound quite fake (and those are actually using real string recordings as a basis!). Not to mention synthesizing melodic voices (AI voices do quite well with spoken word but way less so with melodic singing).