Stem Separation: Non-Negative Matrix Factorization (NMF)

Another Story to Tell

When trying out and researching different methods for stem separation, many were well documented to a point of being trivial implementations, such as anything related to a convolutional neural network or TensorFlow/PyTorch/Scikitlearn. Therefore, we wanted to use a lesser known algorithm to further our learning, and ended up on Non-Negative Matrix Factorization (NMF) using Librosa. The general idea is very simple: use built-in matrix methods to identify parts of a matrix (such as an image) to identify relevant components of the matrix but also split it up into a weights (components) matrix and a hidden variables (activation) matrix that when multiplied together represent how prominent each of the found components is at a different location or time interval depending on the problem. This method is very similar to Principle Component Analysis and both we have seen to be compared to each other in literature. However, the best part about NMF is how seamlessly it transitions to work with spectrograms and the STFT, as the magnitudes of different frequencies are captured well in a frequencies x samples matrix that we can manipulate easily to our needs.

Matrix Decomposition in NMF Diagram by Anupama Garla, found at:

https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8

Stem Separation: About Me

Stem Separation: Non-Negative Matrix Factorization (NMF)

Another Story to Tell

Matrix Decomposition in NMF Diagram by Anupama Garla, found at:

https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8

Anchor 1

To the left is the stem magnitude spectrograms for the 7s clip of Fergessen the wind provided by the MusDB dataset. The spectrogram must be magnitudes as otherwise they would have negative components which aren't allowed in the matrix. As you can see, each stem has distincy components, with bass having the highest magnitude components at lower frequencies, vocals having spikes at intervals corresponding to the singer, drums being played at a specific tempo with notable harmonics along each vertical and the "other" category having a mix of longer lasting lower frequencies and faster higher frequencies when approrpriate.

Next, we have the first output of the Non-Negative Matrix Factorization, the 4 components (which we hope match up to vocals, bass, drums, and other). Of note, NMF is not supervised in our case, so we have some work to do, though by observation it seems like the separation may have been successful. Going from left to right, the 1st component has a broad frequency range but thinner bands which may correspond to "other." The 2nd component has high magnitudes at the lower components, which is likely bass, but also encompasses frequencies that are too high, a likely issue. Between the final two, it is tough to figure out which is vocals or drums without the activations, but given drums has a smaller very low frequency component we can match drums to the 3rd and voice to the 4th component.

Necessary to go with the components are the activations. The activations not only tell us when one of the 4 main components are played in the song, but also whether the component is louder or softer, which is very useful when comparing to spikes in the known stem. Note: Read the rows that correspond to each component up-down.

Note: All plots above generated by Jedidiah Pienkny

Wrapping up, now we get to the magic of NMF! Given how the two matrices were split, they can be multiplied together with stock matrix multiplication to get a reasonably good reconstructed spectrogram from the stems in the new basis of the four components. Of note, the matrix produced my NMF is generally lower rank than the original matrix, which accounts for the less detail in the reconstruction. However, one can clearly compare the input and reconstructed spectrogram and immediately see the similar patterns and structures. Increasing or decreasing the NMF matrices would change this resolution.

Finally, we look at the spectrograms of the 4 components with their activations and we can verify some of our findings from earlier. The top-left spectrogram encompasses the low frequency patterns of bass, though incorrectly having a small peak around 9s. The highest magnitude and uniform spectrogram overall looks to compare to either other or drums but more likely other due to the higher magnitude lower band. Finally, comparing the bottom two spectrograms, we notice that while similar, the bottom left plot has higher magnitudes earlier in the song but a smaller low frequency band, and the bottom right plot has higher magnitudes later in the song and a slightly larger low frequency band. While it seems that drums and vocals were confounded a bit together (as this is a single song as an example) it seems the the bottom left corresponds to drums and the bottom right corresponds to vocals (in some sense).

All in all, using NMF was a great choice for working with spectrograms. While we only showed a single example, implementing more examples into another classifier/fitting algorithm would increase the accuracy of results further.