Meeting 10 (05.06)

Goal: discussion of the results.


  • Minimize Z first.
  • Initialize Z with zeros.
  • Run dozen of experiments on the cluster (200 and 500 songs). A new instance is capable of working with the full dataset (1000 songs) with 30GB of RAM.
  • Test hyper-parameters with less frames but more genres (to diminish cross-validation variance). It also makes the problem harder, as we are near perfection with two classes.
  • Create the graph (audio_graph.ipynb).

Main results

  • Feature extraction (only Z and D) improves accuracy over raw CQT spectrograms.
    • 2 genres: from 96 (+/- 4.7) to 99 (+/- 3.2)
    • 5 genres: from 76 (+/- 3.1) to 79 (+/- 2.7)
    • 10 genres: from 56 (+/- 5) to 65 (+/- 3.6) (LeCun is at 79-83, best is 93)
  • Outside extreme value, ld does not seem to have a large impact.
  • Around 10 minutes to run a KNN search (with FLANN) and create the graph for the complete dataset (around 1.2 million nodes with dimensionality 96).
  • Training time is proportional to \(\lambda_d\), linear to m.
  • OpenBLAS is 1.4 times faster than ATLAS.
  • Even if the objective function is not convex, we fall into similar local minimums despite random initialization.


  1. (a) Minimize for D first or Z first. Z first needed if we initialize Z with zeros or project columns of D on the ball (not int). Not much difference on runtime. Z first looks better on convergence graph if initialized with zeros (instead of random values).
  2. (b) Cluster: preprocessing of the whole dataset and feature learning on 200 songs. It is 4-5 times faster than my laptop. The matrix multiplication performed by ATLAS is however not multi-threaded.
  3. (c) Need normalization on Z ? No because they have the same energy as X (due to the constraint on D) which is normalized.
  4. (d) Train an encoder (with le=ld=10): it does not seem to alter much the performance (either accuracy or training speed). The value of \(\|Z-EX\|_F^2\) is indeed an order of magnitude smaller than the other objectives, which indicate that it has little impact. Z is however two times less sparse. The purpose of E is thus solely to speed up the inference of Z. Need however to experiment with other values of le.
  5. (e) Learn a dictionary over 1000 songs: not enough memory, even with the 20GB of memory of my cluster instance.
  6. (f) Learn a dictionary and classify for 500 songs. Feature extraction (only Z and D) improves performance from 76 (+/- 3.1) (using spectrograms) to 79 (+/- 2.7) for 5 genres.
  7. (g) Influence of N_outer. Just be sure it has converged. With rtol=1e-5 it converges after 11 iterations for 200 songs and 7 for 500 songs. Set to 15 to be sure (additional iterations are cheap).
  8. (h) Influence of ld. Something around 10 sounds good.
    • ld=10: 11 outer is enough, data and prior terms on the same order of magnitude, sparsity 3%.
    • ld=100: 15 outer is not enough, data term an order of magnitude lower than prior, all objectives higher than ld=10, sparsity 33%, atoms seem noisy. Training 2x slower than ld=10. But same classification accuracy. Seems however too much.
    • ld=1: 6 outer is enough, prior term an order of magnitude lower than data, all objectives lower than ld=10, sparsity 0.1%, peaks in atoms are smaller. Training 5x faster than ld=10. The classification accuracy however severely drops to 58%, probably due to a too sparse representation. Definitely too low.
  9. (i) Influence of m. In the end we’ll use 512 with 1000 songs.
    • m=512: 8 outer is enough, 4x slower (linear to m), higher prior and lower data objectives, sparsity 1.1%, classification accuracy climbs to 99 (+/- 3.2).
  10. (j) Learn a dictionary over 1000 songs (ok with 30GB of RAM, takes 10 hours). Accuracy of 65 (+/- 3.6).
  11. (k) ATLAS vs OpenBLAS. ATLAS (the version packaged by Ubuntu) is mono-threaded while OpenBLAS distributes the computation on the 12 cores. Note that large matrix multiplication is however memory bandwidth limited so that the cores do not run at 100%. OpenBLAS is 1.4 times faster. Use a common seed ?
    • OpenBLAS: 4379s
    • ATLAS: 6152s
  12. (k) Influence of Nvectors. Maybe there is too little features with 1 vector. We should test that on the full dataset. We will work with Nvectors=6 as we have no improvement over baseline with Nvectors=1 and thus cannot interpret the effect of hyper-parameters.
    • Nvectors=6 (Nframes=149): 74 (+/- 4.3), baseline (Xs) 68 (+/- 6.1)
    • Nvectors=1 (Nframes=149): 65 (+/- 4.9), baseline (Xs) 65 (+/- 3.9)
  13. (l) Influence of convergence (rtol). Double training time for 1%. We’ll decrease rtol for final accuracy measurements.
    • rtol=1e-3: 718s, 3 outer is enough, objective 1.138e5, sparsity 4.6%, 74 (+/- 4.5)
    • rtol=1e-5: 4379s, 9 outer is enough, objective 1.016e5, sparsity 1.6%, 74 (+/- 4.3)
    • rtol=1e-7: 9817s, 15 outer is not enough, objective 1.005e5, sparsity 1.4%, 75 (+/- 4.6)


  • No need to constrain E. All its values are smaller than 1 because X is normalized. As D is constrained (normalized in practice because of \(\ |Z\|_1\)), it does not change the energy thus Z has the same energy as X. That would lead to the single objective \(\|Z-EX\|_F^2\) when minimizing for E. Use a second order method like Newton instead of (slow) gradient descent ?
  • We take more time to train Z than D while D is what we want to learn in training.
  • Variance in cross-validation is high if the number of genres / songs are low.
  • Z is indeed much more sparse than X.
  • Outer convergence is reached when each inner minimization converges immediately (one step).


  • Run KNN only once. Then we can change the weights (in response to changes of \(\sigma\)).
  • Normalize X in audio_preprocessing ? Then no scaling in audio_classification nor audio_features.
  • Write a script who runs audio_features and several audio_classification with configurable parameters (ld, m, N) to ease testing and hyper-parameter tuning. audio_classification should classify with Xs, Z (and approximate Z).
  • Tune \(\lambda_e\) such that all objectives are in the same order of magnitude ?
  • Test the Laplacian by introducing noise in the signal used to construct the graph (moving it off the manifold) and optimize for smoothness using the constructed Laplacian. It should push the points back to the manifold.


  • How to have a undirected graph (i.e. symmetric weight matrix) ? Do we even want it ? Yes. We will use \(\max(W, W^T)\).
  • Use a TV regularization instead of the Dirichlet energy. Define the continuous and discrete TV then how to solve it.
  • Approximate TV by only taking low frequencies into account. We would do it by filtering out the highest eigenvalues of the Laplacian matrix.


Michaël Defferrard

I am currently pursuing master studies in Information Technologies at EPFL. My master project, conducted at the LTS2 Signal Processing laboratory led by Prof. Pierre Vandergheynst, is about audio classification with structured deep learning. I previously devised an image inpainting algorithm. It used a non-local patch graph representation of the image and a structure detector which leverages the graph representation and influences the fill-order of the exemplar-based algorithm. I've been a Research Assistant in the lab, where I did investigate Super Resolution methods for Mass Spectrometry. I develop PyUNLocBoX, a convex optimization toolbox in Python.

Leave a Reply