Meeting 5 (27.03)

Goal: final review of the optimization schemes. We now move on implementation (toward ISMIR paper) and will come back to optimization later.


  • Review of the proposed optimization scheme by Xavier which uses primal-dual (Chambolle-Pock) [5] for the three convex minimizations.
  • Complete batch training optimization scheme:
    1. \(\min_D\) : FISTA [3].
    2. \(\min_Z\) : FISTA with adaptive restart [4].
    3. \(\min_E\) : A forward-backward based primal-dual formulation which exploits the gradient (unlike Chambolle-Pock). It is a generalization of Chambolle-Pock and ISTA [2]. Unlike these two, the method is however not accelerated. There exists variants for this purpose.
  • Characteristics:
    1. No trick, the formulation fits each solver.
    2. Optimal convergence in \(O(1/k^2)\) for \(Z\) and \(D\).
    3. Linear convergence for \(Z\) after some iterations thanks to adaptive restart.
    4. All available gradients are exploited.
    5. Each minimization is distributable, i.e. the sub-problems are independent. The matrices only need to be reunified in the outer loop.
    6. Easy to add the \(tr(ZLZ^T)\) regularization. We would however loose the parallelization ability by definition of the problem (smooth gradient for each dimension of the sparse code).
  • The paper git repository is now on GitLab (was previously local).


  • Why not set \(E = D^T\) ? See mathematically or with experiments if it is equivalent.
  • Why constraints on rows of \(E\) ? It was a mistake by Xavier, they should be on columns.
  • Discussion of my optimization scheme.
  • How to clearly states rows and columns of matrices ?
  • pyGSP status: can create KNN graphs and compute the graph Laplacian. There is some unit tests but it is preferable to compare with matlab.
  • Literature reviews:
    1. Are other people working on clean optimization schemes for deep learning ?
    2. Optimization schemes for sparse coding. Does everybody use FISTA for Z ? What are they using for D ?
  • Agenda:
    1. Auto-encoder prototype for images (easier to have some feelings).
    2. Setup: pre-processing and post-processing.
    3. Merge 1 and 2: results with genre recognition.
    4. Interpret results (classification accuracy, learned basis).
    5. Write paper (6+1 pages): abstract April 27th, deadline May 4th.

 Next week

  • Start the implementation of a prototype in Python for images.
  • Write down the graph definition.
  • Prepare a new revision of the paper for review.
  • Continue to read the amassed papers.


  • Another convex optimization algorithm by Nesterov [1].
  • Nice discussion of duality with algorithms in [2].
  • Other methods to investigate for \(\min_E\): SDMM (a parallel generalization of ADMM) [6], ppxa [7], generalized forward-backward [8]. Ils sont lents, mais pas de convergence optimale possible de toute manière. Autre solution: enlever \(||EX||_1\) (et résoudre avec FISTA), la contrainte (et résoudre avec Chambolle-Pock accéléré) ou les deux (et résoudre par inversion de matrice ou gradient descent). Convergence rapide dans tous les cas.
  • Xavier: the Lipschitz constant is less important for Chambolle-Pock than it is for FISTA.


[1] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function.
[2] Komodakis, N., & Pesquet, J. C. (2014). Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. arXiv preprint arXiv:1406.5429.
[3] Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.
[4] O’Donoghue, B., & Candes, E. (2013). Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 1-18.
[5] Chambolle, A., & Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1), 120-145.
[6] Pesquet, J. C., & Pustelnik, N. (2012). A parallel inertial proximal optimization method. Pacific Journal of Optimization, 8(2), 273-305.
[7] Combettes, P. L., & Pesquet, J. C. (2008). A proximal decomposition method for solving convex variational inverse problems. Inverse problems, 24(6), 065014.
[8] Raguet, H., Fadili, J., & Peyré, G. (2013). A generalized forward-backward splitting. SIAM Journal on Imaging Sciences, 6(3), 1199-1226.


Meeting 4 (20.03)

Goal: final discussion about the optimization schemes.


  • Prepared and installed new computer.
  • Improved the draft paper with feedback from Xavier and Pierre.
  • Refined the two optimization schemes. Complete version of the online training optimization scheme.
  • Study of proximal splitting algorithms for convex optimization:
    • Forward-backward splitting: FISTA (dual prox), Primal-Dual (Chambolle-Pock)
    • Douglas-Rachford splitting: ADMM (uses augmented Lagrangian)
  • Completed the draft paper section manifold learning: better motivation. Various small improvements.


  • Optimization
    • We should not search the optimal encoder with \(\min\limits_E \|z-Ex\|_2^2 + \|Ex\|_1\). The system is largely underdetermined as we only have one example. We should instead work with \(X\) and \(Z\).
    • The batch scheme is well-posed, good for math. Furthermore it allows a parallel with LeCun (through ADMM), but it may be slow due to the two nested loops. It can be accelerated by not finding the exact solution of each sub-problem, but to stop after some iterations with a good enough approximation. The online scheme targets implementation (approximate and faster).
    • Xavier proposed a primal-dual algorithm to minimize for \(D\). We may use PD for everything, even \(Z\) (we would have used FISTA).
    • Three properties of non-convex optimization: monotony (of the energy function), convergence (local, global, critical points) and speed. Convergence is hard to prove.
    • The multivariate non-convex optimization is guaranteed to be monotone if each convex sub-problem is optimally solved.
    • We should validate our algorithm with ground truth experiments. We may use the Fourier transform and its inverse for \(D\) and \(E\). Then we can choose \(X\) and directly compute the \(Z\). Our algorithm should then be able to retrieve one of the unknowns knowing the two others.

Next week

  • Write down properly the complete optimization scheme: outer-loop with 3 inner-loops (PD for D and E, FISTA/PD for Z).
  • Write down the graph definition.
  • Prepare a new revision of the paper for review.
  • Start to implement.
  • Continue to read the amassed papers.


Meeting 3 (13.03)

Goal: first review of the ISMIR paper, theoretical discussion about the optimization schemes and the graph.


  • ISMIR paper draft: structure, abstract, algorithm.
  • Proposition of two optimization schemes: the first relies on stochastic gradient descent and permits online learning while the second works in batch mode.
  • Reflection on the usefulness and added complexity of \(\|Ex\|_1\).


  1. Paper
    • General comments and notation.
    • Can we cite unpublished (arXiv) work ? Yes because arXiv is now huge and recognized.
    • Capital letters to acronyms ? Yes.
  2. Optimization
    • Review the two optimization schemes. Which one would be the easier / faster ? We leave the batch and online schemes for now and will experiment and compare later.
    • We will start with the online scheme without \(\|EX\|_1\) as it is simpler to solve / implement. We can then integrate this term and compare the performance. We may also compare with the batch scheme.
    • Discuss the algorithm to minimize for \(E\). Nathanaël proposes to use FISTA and eventually integrate the unit norm constraint.
    • Xavier proposed a primal-dual algorithm to minimize for \(E\).
    • Constraint on the dictionary columns \(d_i\): Xavier claims that it should be smaller or equal to one (not strictly equal to one) to allow the algorithm to not populate the full dictionary if it is too big. I think that without integrating a term that penalize the energy of \(D\) in the cost function the algorithm will exploit the full dictionary as it will help to minimize the reconstruction error. So the relaxed constraint allows a not fully populated dictionary but do not enforce it.
  3. Graph
    • To determine: type (epsilon or KNN graph), kernel (Gaussian or polynomial) and distance function (Euclidean or cosine).
    • Which distance function to create the graph ? Distance learning. A student of the lab works on finding the optimal distance.
    • Which graph Laplacian to use ? Xavier said un-normalized \(L=D-W\) is bad because it does not converge to the continuous Laplace-Beltrami operator (not consistent). We should therefore prefer the (symmetric) normalized Laplacian \(L=D^{-1/2} W D^{-1/2}\). Johan said that it is so easy to try both that we should compare.

Next week

  • Improve the paper with the various feedback.
  • Study convex optimization [1,2,3,4].
  • Clarify the optimization scheme.
  • Start to implement.
  • Continue to read the amassed papers.


  • I now use \(D\) for the dictionary and \(E\) for the encoder while the energy functions are written using matrices. It greatly simplifies the notation.


[1] Combettes, P. L., & Pesquet, J. C. (2011). Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering (pp. 185-212). Springer New York.
[2] Komodakis, N., & Pesquet, J. C. (2014). Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. arXiv preprint arXiv:1406.5429.
[3] Chambolle, A., & Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1), 120-145.
[3] Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.


Meeting 2 (06.03)

Goal: clarify the encoder architecture and discuss the complete scheme.


  • Documentation on frameworks used by groups working in deep learning: Torch7 (a MATLAB-like environment for machine learning in LuaJIT with a C/OpenMP/CUDA implementation) and Theano (a framework to define and optimize mathematical expressions in Python with efficient evaluation on both CPU and GPU). PyLearn2, Lasagne and Blocks are built on top of it.
  • Choice of an implementation language and framework: we will use Python as it is widespread in the ML community (e.g. scikit-learn) and possesses all the tools needed for audio pre-processing (e.g. Marsyas, PyMIR, Yaafe, Bregman, CQ-NSGT). As we start with a unique layer and a small dataset we don’t need the power of Torch7 or Theano for now. We will use our own pyUnLocBoX library to minimize the objective function.
  • Background documentation to learn where our project seats in current research: LeCun‘s group at Facebook mostly works on convolutional neural networks (CNN). Juergen‘s group at IDSIA works on recurrent neural network (RNN) and long short term memory (LSTM). Bengio‘s group at LISA works on deep belief networks (DBN) composed of sequentially trained (in an unsupervised way) layers of restricted Boltzmann machine (RBM) or auto-encoders (some RNN too). Mallat‘s group at École Polytechnique works on scattering transforms. Hierarchical temporal memory (HTM) are developed by the startup Vicarious. Google DeepMind (3 former PhD students from Juergen) works on reinforcement learning on deep architectures.
  • Study of sparse auto-encoders architectures:
    1. Sparse coding [6]: sparse code iteratively found by ISTA (minimization problem).
    2. Predictive sparse decomposition (PSD) [7]: sparse code approximation via an encoder (no iterations, faster).
    3. Learning ISTA (LISTA) [8]: use (F)ISTA algorithm architecture for the encoder.
    4. Learning coordinate descent (LcoD) [8]: faster than LISTA.
    5. Discriminate recurrent sparse auto-encoder (DrSAE) [9]: add a cross-entropy term (for classification loss).
    6. Convolutional sparse coding [10]: much cheaper to apply to a whole image than patch based learning.
  • Reflections on multiple layers of linear transformations: there is no point to stack multiple layers of linear \(z=A_e x\) encoders as they can be reduced to one. Use non-linear encoders instead. While \(\|A_e x\|_1\) enforces the sparsity of training data representations, the representation of unseen data may not be sparse without the “Sparsifying Logistic non-linearity” introduced in [1]. This may not be a problem for a classifier but then we cannot say that we approximate a sparse representation.


  • Clarification of the encoder architecture as \(z = \min\limits_z \|z-A_ex\|_2^2 + \|z\|_1 = h_\theta (A_ex)\) which is indeed non-linear. It is a computationally efficient approximation of the sparse coding problem which does not require the use of an iterative algorithm.
  • Statement of the sparse coding problem \(\min_z \|x-A_dz\|_2^2 + \|z\|_0\) as an NP-hard problem which can be approximated by \(\min_z \|x-A_dz\|_2^2 + \|z\|_1\). The approximation is equivalent if the restricted isometry property (RIP) [12] is met (which is not our case). We hope that adding structure to the problem will help to find a better solution. This is the idea behind the graph regularization term \(tr(Z^TLZ)\) [13] and a reason not to drop the dictionary term \(\|x-A_dz\|_2^2\) which is not used after training.
  • Discussion of [5] as an approximation to our more general model.
  • Clarification of the pre- and post-processing steps.
  • Introduction of a baseline classifier to compare with.
  • Xavier found the faster ISTA [11].

Next week

  • Clearly state the optimization problem:
    1. State and explain the global energy function.
    2. Per variable optimization (justify solver choice).
    3. Shows the parallel with [5].
  • Setup a git repository to work on the paper.
  • Start the implementation:
    1. Setup the pre-processing steps: frames, CQT and LCN.
    2. Setup the classification scheme: pooling, linear SVM and voting.
    3. Shows incremental performance of the baseline (without feature learning) classifier as we add processing steps to prove their usefulness.


  • Advantage of a decoder / dictionary: it is possible to analyze the learned filters. How would we now what features the layers extract with the encoder only ? Note that with ConvNets it is possible to observe the convolution kernels. The decoder can also be exploited to explore whether the auto-encoder has found a good representation of the manifold where data resides. By slightly modifying the representation \(z\) we should generate samples \(x = A_d z\) on the manifold. Such generated data points should be probable. Our unsupervised feature learning problem can be seen as a kind of manifold learning.
  • From [1] : “It seems that a non-linearity that directly sparsifies the code is considerably simpler to control than adding a sparsity term in the loss function, which generally requires ad-hoc normalization procedures.” By citing [2] he means that we should avoid the trivial solution of the minimization where the basis functions grow without bound and the weights decrease toward 0 (satisfying their L1 normalization).
  • Make use of Python notebooks.
  • A deep learning book from Bengio is in preparation (draft manuscript).
  • First layer learns time-frequency like features directly from the signal [3]. Second layer learns higher level features like chord and harmonics. If time-frequency is a good representation, then unsupervised features learning should retrieve it [4].
  • Future work: convolutional sparse coding [10] with raw audio should efficiently uncover a time-frequency representation. Convolution on whole song is computationally more efficient than learning on patches.
  • Deep auto-encoders: encoders and decoders are composed by layers of neurons. This is different from stacks of sparse auto-encoders.


[1] Poultney, C., Chopra, S., & Cun, Y. L. (2006). Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems (pp. 1137-1144).
[2] Olshausen, B. A. (2002). Sparse codes and spikes. Probabilistic models of the brain: Perception and neural function, 257-272.
[3] Dieleman, S., & Schrauwen, B. (2014, May). End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 6964-6968). IEEE.
[4] Humphrey, E. J., Bello, J. P., & LeCun, Y. (2013). Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information Systems, 41(3), 461-481.
[5] Henaff, M., Jarrett, K., Kavukcuoglu, K., & LeCun, Y. (2011, October). Unsupervised learning of sparse features for scalable audio classification. In ISMIR (Vol. 11, p. 276).
[6] Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1?. Vision research, 37(23), 3311-3325.
[7] Kavukcuoglu, K., Ranzato, M. A., & LeCun, Y. (2010). Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467.
[8] Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 399-406).
[9] Rolfe, J. T., & LeCun, Y. (2013). Discriminative recurrent sparse auto-encoders. arXiv preprint arXiv:1301.3775.
[10] Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010, June). Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (pp. 2528-2535). IEEE.
[11] O’Donoghue, B., & Candes, E. (2013). Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 1-18.
[12] Candes, E. J., & Tao, T. (2005). Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12), 4203-4215.
[13] Zheng, M., Bu, J., Chen, C., Wang, C., Zhang, L., Qiu, G., & Cai, D. (2011). Graph regularized sparse coding for image representation. Image Processing, IEEE Transactions on, 20(5), 1327-1336.

Meeting 1 (27.02)

Goal: Formulate the unsupervised feature learning problem as the minimization of an energy function.


  • Read and understand the method of [1].
  • Learn about the Constant-Q transform [2], a pre-processing step to transform the audio signal in a time-frequency representation.
  • Research tools for audio pre-processing. Most of them are available in MATLAB or Python.


  • Xavier models a linear encoder \(z = A_e x\) while [1] uses a non-linear function \(z = h_\theta (Wx+b)\) where \(h_\theta (x)_i = sgn(x_i)(|x_i|-\theta_i)_+\). This shrinkage function sets any code components below a certain threshold to zero, which enforces sparsity.
  • Instead of enforcing the sparsity by a soft-thresholding \(h_\theta\) we introduce a sparsifying term \(\|A_e x\|_1\) in the loss function.
  • An open question is whether the term \(\|z-A_e x\|_2^2\) enforces the sparsity of \(A_e x\) if \(z\) is sparse (enforced by \(\|z\|_1\)).
  • Once our objective function is defined, we can iteratively minimize for \(z\), \(A_e\) and \(A_d\) while keeping the two others fixed. The sparse code \(z\) can be minimized by FISTA [3] while \(A_e\) and \(A_d\) can be minimized by stochastic gradient descent.

Next week

  • Clearly state the optimization problem:
    1. global energy function
    2. per variable optimization (justify solver choice)
    3. shows a parallel with [1]
  • Set up a git repository to work on the paper.


  • Our approach is driven by the minimization of an energy function whereas [1] trains a non-linear encoder to find an approximation of the optimal sparse representation.
  • Stochastic gradient descent combined with the backpropagation algorithm is the de facto standard to train deep architectures.


[1] Henaff, M., Jarrett, K., Kavukcuoglu, K., & LeCun, Y. (2011, October). Unsupervised learning of sparse features for scalable audio classification. In ISMIR (Vol. 11, p. 276).
[2] Brown, J. C. (1991). Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America, 89(1), 425-434.
[3] Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.

Paper: Unsupervised learning of sparse features for scalable audio classification


  • Unsupervised feature extraction instead of hand-crafted basis functions (Gabor, Wavelet etc.).
  • Train an encoder (Predictive Sparse Decomposition, PSD) to quickly find an approximation of the sparse code instead of the iterative procedure (matching pursuit, basis pursuit) required to infer the optimal sparse code.


  1. Pre-processing:
    1. Frames of 1024 samples.
    2. Constant-Q transform (CQT): 96 filters from C2 to C6 (quarter-tone resolution).
    3. Subtractive and divisive local contrast normalization (LCN) applied to spectrogram.
    4. Optionally train one feature extractor per octave.
  2. Unsupervised feature extraction:
    1. Iteratively learn the sparse code \(z\) and dictionary \(B\) (size 512):
      1. Get a sample signal \(x\).
      2. Find the optimal sparse code \(z^*\) by minimizing \(L_d = \frac{1}{2}\|x-Bz\|_2^2 + \lambda\|z\|_1\) with respect to \(z\) using FISTA (gradient descent can also be used).
      3. Update the dictionary with one step of stochastic gradient descent: \(B \leftarrow B-\nu \frac{\partial L_d}{\partial B}\). Columns of \(B\) are then rescaled to unit norm.
    2. Iteratively train the non-linear encoder \(U=(W,b,\theta)\):
      1. Get a sample signal \(x\) and compute its optimal sparse code \(z^*\).
      2. Update the encoder parameters with one step of stochastic gradient descent: \(U \leftarrow U-\nu \frac{\partial L_c}{\partial U}\) where \(L_c = \frac{1}{2}\|z^*-h_\theta (Wx+b)\|_2^2\) and \(h_\theta (x)_i = sgn(x_i)(|x_i|-\theta_i)_+\).
    3. After training, the (approximate) sparse code of a new sample is given by \(z = h_\theta (Wx + b)\).
  3. Classification:
    1. Aggregate features over 5 seconds. It results an histogram of the occurrences of each dictionary element in the window.
    2. Linear Support Vector Machine (SVM, implementation of [1]) with aggregate feature vectors.
    3. Genre determination by voting over all aggregate feature vectors in the song.


  • Short frames because chords do not last.
  • The CQT refers to a time-frequency representation where the frequency bins are geometrically spaced and the Q-factors (ratios of the center frequencies to bandwidths) of all bins are equal. The transform mirrors the human auditory system, whereby at lower frequencies spectral resolution is better, whereas temporal resolution improves at higher frequencies. This is ideal to represent auditory signals. A MATLAB implementation is proposed by [2].
  • LCN can be seen as a form of automatic gain control, introduced in [3].
  • Three learnings: sparse coding, dictionary learning, encoder training.

Critical analysis

  • Unsupervised feature extraction: the features may not be the best for genre classification. But they might be general enough for other tasks.
  • No explanation why they define the non-linear encoders by \(z = h_\theta (Wx + b)\). Only that soft thresholding enforces sparsity.


  • End-to-end learning: feature extraction and classification using a deep architecture.
  • Recurrent network: inject back the genre decision of the temporally preceding frame. Jürgen Schmidhuber (from IDSIA, Switzerland) works on recurrent networks.
  • Concatenate octaves: one dictionary to represent the four octaves. At least the harmonics and chords have the same structure among octaves. Each instrument however has its own tessitura (pitch range).


[1] Chang, Chih-Chung, and Chih-Jen Lin. “LIBSVM: a library for support vector machines.” ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[2] Schörkhuber, Christian, and Anssi Klapuri. “Constant-Q transform toolbox for music processing.” 7th Sound and Music Computing Conference, Barcelona, Spain. 2010.
[3] Pinto, Nicolas, David D. Cox, and James J. DiCarlo. “Why is real-world visual object recognition hard?.” PLoS computational biology 4.1 (2008): e27.

Master project: Audio Classification with Structured Deep Learning


The goal of the project is to learn discriminative audio features in an unsupervised and hierarchical way. The method leverages on the recent advances of deep learning using efficient auto-encoder strategy that simultaneously learned sparse representation of inputs and dictionary adapted to those inputs. On top on this layer, we add a new term that enhances the auto-encoder technique. This term is essentially the Dirichlet energy on the graph of audio inputs. It integrates complementary audio structures in the auto-encoder scheme. We will apply this audio classification technique to some benchmark audio datasets including GTZAN.


We want to enhance the work of [1] (which uses one layer of auto-encoder) and publish in ISMIR. We will mainly work on feature extraction and will reuse the pre-processing and classification ideas.

General idea

We want to solve three problems

  1. sparse coding
  2. dictionary learning
  3. encoder training

as the minimization of the energy function
\(\min\limits_{z,A_e,A_d} \sum_\limits{i=1}^{N} \frac{\lambda_e}{2} \|z_i-A_ex_i\|_2^2 + \frac{\lambda_d}{2} \|x_i-A_dz_i\|_2^2 + \|z_i\|_1\)
where \(x_i\) are the data points, \(z_i\) the sparse codes, \(A_e\) defines the encoder and \(A_d\) the decoder (dictionary).

This problem can be solved by iteratively optimize for each unknown. The following algorithms can be used:

  • \(z\) : FISTA, primal-dual, ADMM
  • \(A_e\) and \(A_d\) : stochastic gradient descent, augmented Lagrangian


Once we obtain good results with the presented scheme, we will

  1. use similarity information between inputs (structured deep learning). We want similar signals \(x_i\), in terms of distance on the manifold of possible input signals \(\{x_i\}\), to have similar sparse codes \(z_i\). As this manifold is unknown we construct a weighted graph from the samples \(x_i \in M\) and compute our regularity measure which is the Dirichlet energy of the sparse codes \(\|Z\|_{D_G} = \sum \|\nabla z_k\|_2^2 =  tr(Z^TLZ)\) where \(L=D-W\) is the graph Laplacian (in its non-normalized form), \(Z_k\) are the rows of \(Z\) and the columns of \(Z\) are the sparse codes \(z_i\). The weights \(W\) will be given by a similarity function yet to be specified. [2] showed that explicitly taking into account the geometrical structure of the data space, a submanifold, is indeed effective.
  2. stack multiple layers of such auto-encoders to form a deep architecture.
  3. integrate a multi-modal aspect using song annotations.


[1] Henaff, M., Jarrett, K., Kavukcuoglu, K., & LeCun, Y. (2011, October). Unsupervised learning of sparse features for scalable audio classification. In ISMIR (Vol. 11, p. 276).
[2] Zheng, M., Bu, J., Chen, C., Wang, C., Zhang, L., Qiu, G., & Cai, D. (2011). Graph regularized sparse coding for image representation. Image Processing, IEEE Transactions on, 20(5), 1327-1336.
Meeting 1

Some inpainting results

Yesterday I cleaned the code of GIIN and ran inpaintings of larger pictures on our cluster. The results are below. While some images were realistically reconstructed, others look weird. I will run simulations with different parameters. Our hope is that some set of parameters will work on most of our test images.


Master project

The EPFL spring semester officially started Monday, as did my Master project at LTS2. While the project was not yet defined, I spent the first week learning (mostly about graph analysis and deep learning), doing some administrative work, discussing with people, elaborating our paper about GIIN. I also took some time to think about what I wanted to explore during a following PhD, leading me to consider which host laboratory and program I wanted to attend.

We ended the week with two axes of research:

  1. A proposal from Xavier is to apply deep learning to musical data. The idea is to train a deep architecture to recognize the style of musical compositions (pop, rock, jazz, etc.). The algorithm would then be tested on our musical datasets (Montreux Festival archive, Kirell collection of features).
  2. A proposal from Pierre is to analyze the structure of a graph by transforming it into a collection of signals. We would then study graph structures using signal processing tools and map certain structures to observed signal properties. The idea is similar to [1], although we would not have any constraint of a back transformation (from signals to graph).

While (2) is an interesting problem with open perspectives, we have no dataset of graphs to test the method against. It may also overlap with Johan thesis, which is also concerned by the problematic. Pierre finally concluded the idea is not defined enough and might be a better fit for a thesis. We thus chose to pursue (1) which is more defined and will allow me to acquire useful skills in data science.

This blog will report my progress. After a semester project working with images and spectral graph theory, let’s play with sound and deep learning ! This work will be supervised by Xavier Bresson, a post-doctoral researcher in data science. We will meet next Monday to discuss the idea in greater details. I look forward to it ! 🙂

[1] Hamon, R., Borgnat, P., Flandrin, P., & Robardet, C. (2015). From graphs to signals and back: Identification of graph structures using spectral analysis. arXiv preprint arXiv:1502.04697.

GIIN: end of the project

My work on GIIN at the EPFL LTS2 laboratory came to an end last December. The GIIN project aimed to explore the applications of spectral graph theory to address the inpainting problem of large missing chunks. We used a non-local patch graph representation of the image and proposed a structure detector which leverages the graph representation and influences the fill-order of our exemplar-based algorithm. Our method achieved state-of-the-art performances.

A report of this work is available here. We are now working on a paper to present our results in IEEE Transactions on Image Processing.