Recently, Kanye west released a device dubbed "Stem Player," the premise being that one could record a song, store it as its stems, and mix the song. "Stems" are contributions to a song from different sources, examples being vocals, percussion, bass, etc. "Mixing" a song is manipulating each of these stems by adjusting aspects such as equalization, compression and reverb to make the final song, or "mix" more balanced and sound better overall. We decided that this would be an interesting project to apply Digital Signal Processing techniques to as all mixing techniques, stem splitting, and stem combination techniques are in the realm of DSP.
 Image courtesy of www.trustedreviews.com
An In-Depth Explanation
For our EECS 351 project, our team is creating a stem player. In essence, a stem player takes digital music files and manipulates them in order to isolate the various parts of their makeup: drums, bass, vocals, and more. This is useful for music enthusiasts who like remixing their favorite songs, or for testing sound systems to determine optimal or sub-optimal performance in specific ranges.
Actually separating sections of the digital music files is done entirely in the frequency domain. Here, specific frequency ranges normally correlate to different instruments, and can be crudely separated into categories like in the image.
 Image courtesy of www.imagemaker.com
Solely relying on this method to isolate the instruments, or “stems”, of a song produces low quality separation due to excessive overlap seen in Figure 1. Instead, our team plans to use machine learning to train software to separate sections in music more cleanly. By using musdb, a dataset containing already separated stems of music tracks, we can train software to perform the necessary frequency manipulations in order to isolate sections of music.
In practice, the process will involve three main steps: reading, separating, and reconstructing.
After reading in an audio file (.wav, .mp3, mp4), the time-domain audio will be converted into the frequency domain using the Short Time Fourier Transform, or STFT. Using the frequency domain information, our software will separate the music into its stems, including vocals, accompaniment, bass, drums etc. A stretch goal is getting more granular, as the difference between for example a flute and a trumpet from frequencies alone can be tricky. Our plan for accomplishing this is primarily to use Machine Learning (basic algorithm or deep learning with TensorFlow) or to implement a more experimental idea of fitting wavelets to different instruments to better represent their likely frequencies.
With the separation complete, the code will output audio files for each individual stem. Ideally, each stem is similar to the individual vocal/instrument component, but there will be crossovers at similar frequency components that if not properly handled will diminish the quality of the individual stem. Summing separate components together should ideally lead to the full song.
After the user has individual stems at their disposal, there will be a few tools at their disposal to manipulate the stems to improve the overall sound of the final song. Tools like bass boosting, equalizing, saturating, delays, and faders are good examples of this.
Work so far has been on mixing stems from the musdb dataset. Over 150 seven-second tracks make up the dataset, each with the full song and tracks for the separated sections (i.e. isolated bass, drums, vocals, and other instrumentation tracks). For one song, the separated vocals and remaining accompaniment tracks were transformed into the frequency domain using Short Time Fourier Transforms. Spectrograms were then taken of the vocal and instrumental tracks.
Images not cited are from the Wix template images