CURRENT PROGRESS

Work so far has been on mixing stems from the musdb dataset. Over 150 seven-second tracks make up the dataset, each with the full song and tracks for the separated sections (i.e. isolated bass, drums, vocals, and other instrumentation tracks). For one song, the separated vocals and remaining accompaniment tracks were transformed into the frequency domain using Short Time Fourier Transforms. Spectrograms were then taken of the vocal and instrumental tracks.

The isolated vocals are highlighted by frequencies below 2 kHz. Three large vertical lines indicate breaks in the audio when the singer is not holding a note.

The isolated accompaniment is shown to be much more dense and defined. The large vertical lines indicate the rhythm of the music. Drums, bass, and other parts of the instrumental are all collected here, with significant activity below 8.5 kHz.

When the isolated vocal and instrumental data is combined in the frequency domain, we see characteristics of each within the plot. The three distinct vocal marks between 6 and 10 seconds as well as the rhythmic frequency spikes from the instrumental are all present. When this combined data is transformed back into the time domain, playing the audio will play the original, complete music track.

Difficulties

It is important to note some difficulties inherent to our work so far. At the moment, we have no reliable way of beginning the machine learning process. Though we understand how to mix stems back into the main work, it is another issue entirely training software to read and separate parts of the instrumentation. We believe getting an early start on this is the best way to ensure our code works well when the project is due.

WHAT WE HAVE LEARNED

We have learned that the work in separating stems goes quite far beyond parsing different frequencies common to certain instruments using various types of Fourier Transform. As was demonstrated in Figure 1 above, many instruments can make sounds in a wide range of frequencies, many of which overlap with each other. We will then need to use machine learning as a digital signal processing tool in conjunction with our frequency domain analysis to more clearly separate songs into stems. It is likely that a perfect separation is incredibly difficult or potentially impossible, and even the commercial stem player produced by West includes undertones some stems in one another. To improve performance, more and larger datasets will be needed for a more robust machine learning algorithm.

PLAN MOVING FORWARD

As we get near the end of the semester, the following weeks will be dedicated to completing the following tasks:

For one, we will need to get the machine learning algorithm started. By using tensorflow and creating a convolutional neural network, we can begin training the software using the musdb dataset. This will be done repeatedly until it is able to properly separate elements of the instrumentation from the original song. We can expect this to be underway starting April 7th.

Next, we will need to experiment with various forms of mixing and effects on the music. This allows us to work on changing the effects of the audio by manipulating the frequency domain. One such example is included below, where the dataset track shown above is bass boosted.

Various mixes such as these will be done to achieve different effects. This has already started, but will be further expanded following the training of the software around April 14th.

Finally, we will need to combine all our documented progress for the 351 report and website by the end of the class. This will require the fulling running list of documentation and tests being done in order to train the convolution neural network, separate audio files, and generally explaining the digital signal processing occurring in our project. This will be started following April 12th.