Links

Repositories

  1. Paper (LaTex)
  2. Code (IPython notebooks)
  3. Results (IPython notebooks)

Notebooks

  1. Algorithm derivation: from sparse coding to encoder learning through dictionary learning
  2. Implementation of our auto-encoder model
  3. Learning Gabor filters from image data
  4. Solver comparison (with Xavier matlab primal-dual formulation)
  5. Genre recognition
    1. GTZAN
    2. pre-processing
    3. graph construction
    4. feature extraction
    5. classification
  6. An experiment launcher which allows to test the influence of an hyper-parameter on various metrics.

Results

See the experiments.

Meeting 12 (19.06)

Goal: report and results review.

Achieved

  • Exploit the idea of the structuring auto-encoder as an hybrid between a sparse and a denoising auto-encoder.
  • Disable majority voting: smaller variance due to increased number of samples to classify (i.e. 12 times more). The accuracy increases: it is easier to classify feature vectors than whole clips.

Main results

  • Introducing the graph Laplacian makes the model very robust against noise.
  • Improvement of 7% in a noisy or noiseless environment. However in the noiseless environment the addition of the graph is not significant.
  • The kNN approximation has an influence on the constructed graph and the accuracy of at least 1.3%. There is room for better performance with a better constructed graph. The quality of the FLANN approximation should be investigated.
  • The hyper-parameters do not seem to be very sensible, i.e. we may not have to fine-tune them.
  • Speed:
    • Noiseless: graph generation is fast (~200s), feature extraction is slow (~2700s)
    • Noisy 10%: graph generation is slow (~1200s), feature extraction is fast (~900s)
    • Classification: ~400s with graph, ~700s without graph.
  • The encoder is adding structure.

Idea

  • It works best when all the objectives have an equivalent value –> auto-tuning of the hyper-parameters ! Can be approximately assessed after one outer iteration already.
  • Higher the $\lambda$, higher the importance of the sub-objective. Try to give equal weights.
  • Train the auto-encoder (graph, dictionary, encoder) on clean data and test on noisy data.
  • Go further and classify individual frames ? That would mean to discard feature aggregation.
  • Use NLDR to assess if there is really an underlying structure / manifold in the data ?
  • Would be interesting to launch multiple simulations with the same hyper-parameters without seeding the RNG to see if the results are stable.

Experiments

  • Keep smallest eigenvalues.
  • Hyper-parameters fine-tuning.
  • Increase sparsity in the presence of noise.
  • Standardization instead of scaling.
  • Accuracy results to be published (report, presentation).
    • (k) Without noise.
    • (k) With 20% noise.
  • (j) Encoder.
  • Performance stability (over various sources of randomness).
    • (i) Distance metric.
    • Zero mean for euclidean.
  • (h) Graph weight lg, noiseless setting.
  • Training / testing ratio.
    • (g) Without graph, without noise.
    • (g) With graph, without noise.
  • Graph vs no graph vs spectrograms.
    • (f) With noise.
    • (f) Without noise.
  • Better graph
    • Scaling (features, samples, dataset). minmax vs std.
    • Disable feature scaling. Then scale the noise ! And verify tolerance.
    • (e) Distance metric.
  • (d) Noise level.
  • (c) Classify individual feature vectors, i.e. no majority voting.
  • (b) Understand the influence of the graph in a noisy data setting. Warning: the graph was created with noiseless data.
  • (a) Noisy signal hypothesis.

Discussion

  • Au final le résultat sera un plot de baseline, no graph et graph w.r.t. noise level. Conclusion: la préservation de la structure (1) extrait de meilleure features et (2) est robuste au bruit
  • Et si chaque feature est normalisée indépendamment. Ça enlève le biais envers les features à forte variation, mais est-ce que c’est ce qu’on veut ? Xavier: “oui dans le cas general.”.
  • Xavier: “on peut appliquer le fameux pagerank regularization, qui est tres robust a des graphs mal construits”.
  • Fait intéressant: il semble que l’encodeur ajoute de la structure ! Car l’énergie de Dirichlet diminue.
  • Xavier: “oui, mais tu ne dois pas oublier que tu fais du *transductive* learning, c’est a dire tu apprends les features en utilisant training + TEST data. C’est pour cela que tu n’as pas une amelioration significative. Si on faisait du *supervised* learning, c’est a dire on utilise seulement les TRAINING data, alors ce probleme est bcp plus challenging que le transductive probleme, et la je pense que nos resultats avec graph seraient bien meilleures! C’est un commentaire que je te conseille d’ajouter a ton resultat pour le mettre en perspective. Aussi la premiere chose a faire apres le PDM est de faire du *supervised* learning avec graph et le comparer avec no graph, je pense que l’on aura des (bonnes) surprises!”
  • Is there anything we can say about good enough local minima ? Xavier: “oui, il peut y avoir des local minima qui sont d’excellentes solutions a des learning problems.”.
  • Or that many local minima are actually similar and we don’t care in which we fall ? Xavier: “NO! a bad solution and a good solution can have the same energy!”.
  • Why energy formulations are good ? Xavier: “Many reasons s.a. good understanding, robustness, existence of solutions, design and analysis of optimization algorithms…”.

Meeting 11 (15.06)

Goal: results discussion and review of the report

Achieved

  • Fix various issues with the graph creation (audio_graph.ipynb). See git commits for further details.
  • Integrate the Dirichlet energy (via the graph Laplacian) in the objective function (auto_encoder.ipynb).
  • More efficient implementation of trace(Z.dot(L.dot(Z.T))).
  • Run various experiments using the graph.
  • Implement a relative stopping criterion for the outer loop.
  • Make a script (audio_experiment.ipynb) to perform automated experiments, i.e. to test the effect of an hyper-parameter on various metrics (e.g. accuracy, speed, sparsity, objective etc.) by automatically generating plots. It also compares to the baseline.
  • Take the mean of multiple 10-fold cross-validation runs as the main metric (audio_classification.ipynb).
  • Add an ls (redundant) parameter to control the weight of the sparse codes (see experiment k).
  • Complete reports of the newest results are now published online.

Main results

  • Introducing the Dirichlet energy (via the graph Laplacian) helps to extract more robust features in the setting of noisy signals.
  • Introducing the Dirichlet energy (via the graph Laplacian) helps to extract better features:
    • 5 genres: from 79 (+/- 2.7) to 81 (+/- 3.4)

Idea

  • Avec notre graphe, on devrait être plus robuste aux données perturbées n’est-ce pas ? (Puisqu’on impose de la régularité sur le manifold.) Est-ce qu’on aurait créé une sorte d’hybride entre un sparse et un denoising auto-encoder ? Le tout sans devoir perturber les données avant training.
  • We may not increase much the accuracy, but become much more robust to perturbations. It is the idea behind NL-means.

Discussion

  • Center the data before measuring the cosine similarity measure ?
  • Tune hyper-parameters in the context of noisy data.
  • Number of good predictions in majority voting.
  • Do not necessarily show the baseline. 😉

Experiments

  • (m) Influence of graph parameters.
  • (l) Influence of ld (small dataset).
  • (k) Influence of ls (small dataset).
  • (j) Influence of lg (small dataset, rtol=1e-5).
  • (i) Influence of convergence (small dataset, lg=100).
  • (h) Influence of lg, half dataset (500 songs, 644 frames, m=256), rtol=1e-3. Ran for 3h30.
    • Observations:
      • lg=0.1: 2208s, 5 outer, sparsity 6.6%, 78 (+/- 3.8)
      • lg=0.5: 1860s, 4 outer, sparsity 8.5%, 77 (+/- 4.9)
      • lg=1: 2118s, 5 outer, sparsity 8.2%, 78 (+/- 6.7)
      • lg=5: 1024s, 10 outer ?, sparsity 40%, 77 (+/- 6.1) (bad convergence)
      • lg=10: 1010s, 10 outer ?, 49.8%, 76 (+/- 6.0) (bad convergence)
      • lg=20: 973s, 10 outer ?, 64.3%, 75 (+/- 5.5) (bad convergence)
      • lg=50: 989s, 10 outer ?, 70%, 73 (+/- 4.5) (bad convergence)
    • Conclusions:
      • Atoms are all similar as we use the same initialization (via RNG seed). Except the not converged ones.
      • lg seems to have some impact, larger than ld.
      • Problem: Dirichlet energy does not count for much as it is 2 order of magnitudes smaller.
  • (g) Influence of lg, small dataset (500 songs, 149 frames, m=128), rtol=1e-3.
    • lg=1: 292s, 3 outer, sparsity 10.3%, 72 (+/- 4.6)
    • lg=10: 330s, 3 outer, sparsity 17.1%, 69 (+/- 5.4)
    • lg=100: 120s, 1 outer, sparsity 78.5%, 55 (+/-  5.9)
  • (f) More efficient computation of \(tr(Z^TLZ)\).
    • Original np.trace(Z.dot(L.dot(Z.T))): 286s, 72 (+/- 5.2), single eval 49.6 µs
    • Hadamard product np.multiply(L.dot(Z.T), Z.T).sum(): 282s, 72 (+/- 6.9), single eval 34.6 µs
    • Einstein summation np.einsum(‘ij,ji->’, L.dot(Z.T), Z): 279s, 73 (+/- 5.5), single eval 30.6 µs
    • Speed increase: 2%
  • (e) Graph Laplacian as float32.
    • HDF5: 50MB to 40MB (500 songs, 149 frames)
    • float64: 296s, 72 (+/- 6.4)
    • float32: 283s, 72 (+/- 8.1)
    • Speed increase: 4%
  • (d) Runtime test.
    • 500 songs, 149 frames, m=512, rtol=1e-5: 7902s, 74 (+/- 3.8)
    • 500 songs, 149 frames, m=128, rtol=1e-3: 279s, 73 (+/- 5.5)
  • (c) Test on 500 songs with \(\lambda_g=1\).
    • Converge after 7 outer too.
    • Same objective values.
    • The Dirichlet penalty is an order of magnitude lower.
    • A bit less sparse (because of increased smoothing): 1.5% –> 1.8%.
    • Atoms look similar.
    • Accuracy increase of 1-2%. 80 (+/- 4.0) without normalization, 81 (+/- 3.4) with minmax.
    • Slower: from 5h30 to 8h15 to extract features.
  • (b) Introducing the Dirichlet energy in the objective, small dataset.
  • (a) Fix the graph creation.

Meeting 10 (05.06)

Goal: discussion of the results.

Achieved

  • Minimize Z first.
  • Initialize Z with zeros.
  • Run dozen of experiments on the cluster (200 and 500 songs). A new instance is capable of working with the full dataset (1000 songs) with 30GB of RAM.
  • Test hyper-parameters with less frames but more genres (to diminish cross-validation variance). It also makes the problem harder, as we are near perfection with two classes.
  • Create the graph (audio_graph.ipynb).

Main results

  • Feature extraction (only Z and D) improves accuracy over raw CQT spectrograms.
    • 2 genres: from 96 (+/- 4.7) to 99 (+/- 3.2)
    • 5 genres: from 76 (+/- 3.1) to 79 (+/- 2.7)
    • 10 genres: from 56 (+/- 5) to 65 (+/- 3.6) (LeCun is at 79-83, best is 93)
  • Outside extreme value, ld does not seem to have a large impact.
  • Around 10 minutes to run a KNN search (with FLANN) and create the graph for the complete dataset (around 1.2 million nodes with dimensionality 96).
  • Training time is proportional to \(\lambda_d\), linear to m.
  • OpenBLAS is 1.4 times faster than ATLAS.
  • Even if the objective function is not convex, we fall into similar local minimums despite random initialization.

Experiments

  1. (a) Minimize for D first or Z first. Z first needed if we initialize Z with zeros or project columns of D on the ball (not int). Not much difference on runtime. Z first looks better on convergence graph if initialized with zeros (instead of random values).
  2. (b) Cluster: preprocessing of the whole dataset and feature learning on 200 songs. It is 4-5 times faster than my laptop. The matrix multiplication performed by ATLAS is however not multi-threaded.
  3. (c) Need normalization on Z ? No because they have the same energy as X (due to the constraint on D) which is normalized.
  4. (d) Train an encoder (with le=ld=10): it does not seem to alter much the performance (either accuracy or training speed). The value of \(\|Z-EX\|_F^2\) is indeed an order of magnitude smaller than the other objectives, which indicate that it has little impact. Z is however two times less sparse. The purpose of E is thus solely to speed up the inference of Z. Need however to experiment with other values of le.
  5. (e) Learn a dictionary over 1000 songs: not enough memory, even with the 20GB of memory of my cluster instance.
  6. (f) Learn a dictionary and classify for 500 songs. Feature extraction (only Z and D) improves performance from 76 (+/- 3.1) (using spectrograms) to 79 (+/- 2.7) for 5 genres.
  7. (g) Influence of N_outer. Just be sure it has converged. With rtol=1e-5 it converges after 11 iterations for 200 songs and 7 for 500 songs. Set to 15 to be sure (additional iterations are cheap).
  8. (h) Influence of ld. Something around 10 sounds good.
    • ld=10: 11 outer is enough, data and prior terms on the same order of magnitude, sparsity 3%.
    • ld=100: 15 outer is not enough, data term an order of magnitude lower than prior, all objectives higher than ld=10, sparsity 33%, atoms seem noisy. Training 2x slower than ld=10. But same classification accuracy. Seems however too much.
    • ld=1: 6 outer is enough, prior term an order of magnitude lower than data, all objectives lower than ld=10, sparsity 0.1%, peaks in atoms are smaller. Training 5x faster than ld=10. The classification accuracy however severely drops to 58%, probably due to a too sparse representation. Definitely too low.
  9. (i) Influence of m. In the end we’ll use 512 with 1000 songs.
    • m=512: 8 outer is enough, 4x slower (linear to m), higher prior and lower data objectives, sparsity 1.1%, classification accuracy climbs to 99 (+/- 3.2).
  10. (j) Learn a dictionary over 1000 songs (ok with 30GB of RAM, takes 10 hours). Accuracy of 65 (+/- 3.6).
  11. (k) ATLAS vs OpenBLAS. ATLAS (the version packaged by Ubuntu) is mono-threaded while OpenBLAS distributes the computation on the 12 cores. Note that large matrix multiplication is however memory bandwidth limited so that the cores do not run at 100%. OpenBLAS is 1.4 times faster. Use a common seed ?
    • OpenBLAS: 4379s
    • ATLAS: 6152s
  12. (k) Influence of Nvectors. Maybe there is too little features with 1 vector. We should test that on the full dataset. We will work with Nvectors=6 as we have no improvement over baseline with Nvectors=1 and thus cannot interpret the effect of hyper-parameters.
    • Nvectors=6 (Nframes=149): 74 (+/- 4.3), baseline (Xs) 68 (+/- 6.1)
    • Nvectors=1 (Nframes=149): 65 (+/- 4.9), baseline (Xs) 65 (+/- 3.9)
  13. (l) Influence of convergence (rtol). Double training time for 1%. We’ll decrease rtol for final accuracy measurements.
    • rtol=1e-3: 718s, 3 outer is enough, objective 1.138e5, sparsity 4.6%, 74 (+/- 4.5)
    • rtol=1e-5: 4379s, 9 outer is enough, objective 1.016e5, sparsity 1.6%, 74 (+/- 4.3)
    • rtol=1e-7: 9817s, 15 outer is not enough, objective 1.005e5, sparsity 1.4%, 75 (+/- 4.6)

Observations

  • No need to constrain E. All its values are smaller than 1 because X is normalized. As D is constrained (normalized in practice because of \(\ |Z\|_1\)), it does not change the energy thus Z has the same energy as X. That would lead to the single objective \(\|Z-EX\|_F^2\) when minimizing for E. Use a second order method like Newton instead of (slow) gradient descent ?
  • We take more time to train Z than D while D is what we want to learn in training.
  • Variance in cross-validation is high if the number of genres / songs are low.
  • Z is indeed much more sparse than X.
  • Outer convergence is reached when each inner minimization converges immediately (one step).

Ideas

  • Run KNN only once. Then we can change the weights (in response to changes of \(\sigma\)).
  • Normalize X in audio_preprocessing ? Then no scaling in audio_classification nor audio_features.
  • Write a script who runs audio_features and several audio_classification with configurable parameters (ld, m, N) to ease testing and hyper-parameter tuning. audio_classification should classify with Xs, Z (and approximate Z).
  • Tune \(\lambda_e\) such that all objectives are in the same order of magnitude ?
  • Test the Laplacian by introducing noise in the signal used to construct the graph (moving it off the manifold) and optimize for smoothness using the constructed Laplacian. It should push the points back to the manifold.

Discussion

  • How to have a undirected graph (i.e. symmetric weight matrix) ? Do we even want it ? Yes. We will use \(\max(W, W^T)\).
  • Use a TV regularization instead of the Dirichlet energy. Define the continuous and discrete TV then how to solve it.
  • Approximate TV by only taking low frequencies into account. We would do it by filtering out the highest eigenvalues of the Laplacian matrix.

11m_discussion

Meeting 9 (29.05)

Goal: discussion of classification accuracy using the features extracted by our auto-encoder.

Related notebooks: audio_features, audio_classification

Achieved

  • Resolved memory exhaustion.
    1. Prevent data copy in the PyUNLocBoX toolbox. The toolbox was internally storing multiple copies of the data when instantiating functions and solvers. Store a reference instead. See git commit for further details. The memory consumption of our application was then divided by 5. No significant impact on speed.
    2. Use 32 bits floating point instead of 64 bits. It saves half the memory and increase the speed by 1.6. We are not memory bandwidth limited anymore. Computations almost always fully load two CPU cores.
  • Projection in the L2 ball, not on the ball surface. It is a convex constraint. No significant improvement in speed.
  • Insert the energy auto-encoder in the pipeline.

Discussion

  • Data normalization: features or samples ?
  • Decrease dimensionality for faster experiments.
  • Approaches for faster convergence.

Next

  1. Experiment with hyper-parameters.
  2. Create the graph.
  3. Insert the Dirichlet energy in the objective.

10d_discussion

Meeting 8 (04.05)

Goal: current implementation state (auto-encoder, classification) and preliminary classification results. Discuss music conferences and schedule.

Related notebooks: audio_classification, auto_encoder, audio_features.

Achieved

  • Complete the description of the pre-processing steps in the paper.
  • Store data in sklearn format, i.e. \(R^{Nxn}\) which is the inverse of our math notation. Redo the pre-processing on the entire dataset.
  • Implementation of post-processing, i.e. classification: feature aggregation, feature vectors visualization, features rescaling, labels generation, linear SVM, majority voting, random train and test sets splitting, cross-validation.
  • Classification results: see the audio_classification notebook for results on accuracy and speed for 2, 4 or 10 genres using spectrograms or raw audio, various classifier implementations and diverse scaling methods.
  • See also the exact methodology and various observations and ideas.
  • Implementation of the auto-encoder as a sklearn Estimator class. Ease the integration with scikit-learn, enable to use it in a sklearn Pipeline to avoid transductive learning. See the auto_encoders notebook for the implementation and comparison_xavier for an usage example.
  • Performance boost of a factor 10 by using ATLAS or OpenBLAS instead of numpy’s own BLAS implementation. Half of the execution time is now used for matrix multiplications by numpy.core._dotblas.dot. The limiting factor for speed is now the memory bandwidth. ATLAS performs multi-threaded matrix multiplication.
  • Unsupervised feature extraction on a reduced dataset (2 genres, 10 clips) using the new implementation of the auto-encoder. See the audio_features notebook for details.
  • Memory exhaustion when working on the full dataset. Three ways to circumvent it:
    1. Work on a reduced dataset. For now only 2 genres of 10 clips pass. It is not sufficient for significant accuracy measurements.
    2. Store \(X\) and \(Z\) on disk via HDF5. As we are already bandwidth limited, introducing SATA and the SSD in the way may further decrease performance.
    3. As each column is independent when minimizing for \(Z\) and each line is independent when minimizing for \(D\), we can independently work on a subset of the problem in RAM and keep the whole data \(X\) and \(Z\) on disk.

Discussion

  • Was not ready in time for ISMIR 2015 (h5-index of 31). We will publish for ICASSP 2016 (h5-index of 47). The deadline is September 25th.
  • Mitigation of memory usage: reduce the number of frames per clip.
  • Schedule: report deadline June 19th, oral defense mid-July, conference paper September 25th.
  • Can Xavier be the expert ? We could then do the defense at the end of June.

Next

  • Classify data after unsupervised feature extraction.
  • Observe what the dictionary learned.
  • Look for an increase in accuracy.

9i_discussion

Meeting 7 (17.04)

Goal: comparison with Xavier’s implementation of primal-dual, audio pre-processing.

Related notebooks: comparison_xavier, gtzan, audio_preprocessing.

Achieved

  • Prototype on images:
    • Dictionary is indeed full. The constraint less or equal instead of less is  however more efficient as it is convex.
  • Comparison with Xavier’s implementation and dataset (each patch has a zero mean, one patch per pixel).
    • Obtained dictionary is very similar despite very different implementations (algorithms and platform).
    • My implementation is slower because of numpy’s implementation of BLAS operations. We will switch to ATLAS or OpenBLAS.
    • My Z is sparser despite a larger \(\|Z\|_1\) objective.
    • Why using an under-complete dictionary, i.e. \(m < n\) ? For speed only.
    • More iterations give a better convergence, i.e. sharper atoms.
  • Improvements of the description of the optimization schemes and the graph definition.
  • Implementation of the audio pre-processing steps: frames and CQT. See the audio_preprocessing notebook for details.
  • Convert the GTZAN dataset to an HDF5 store. Load and save all subsequent data in HDF5. See the gtzan notebook for details.
  • Many improvements to the PyUNLocBoX. See the repository for details.

Discussion

  • LCN can be viewed as one step of inverting the heat equation. Similar to shock filters [1].

References

[1] Osher, S., & Rudin, L. I. (1990). Feature-oriented image enhancement using shock filters. SIAM Journal on Numerical Analysis, 27(4), 919-940.

Meeting 6 (02.04)

Goal: discussion of the first prototype implementation of our auto-encoder and the learned atoms on image patches.

Related notebooks: formulation, image.

Achieved

  • Implementation of a prototype in Python.
    • Optimizing for Z last leads to a lower global objective function (two Z terms, only one D).
    • The solution does not depend too much on the initialization, even if the global problem is not convex.
  • See the formulation notebook for a construction of the algorithm form the base (linear least square fitting) to the version with encoder (no graph yet). Tests on synthetic examples are included.
  • See the image notebook for an application to image data along with observations. We try here to recover Gabor filters.
  • Results from Xavier: J’ai testé les 2 contraintes sur le dictionary: (1) ||d_j|| = 1 et (2) ||d_j|| <= 1. L’algorithme est 2x plus rapide avec (2) ! Je pense que cela vient simplement du fait que la contrainte (2) est convexe, contrairement a (1).

Discussion

  • \(E\) seems to approach \(D^T\) when it is less constrained, i.e. \(m>>n\). This was due to an implementation error which constrained the L1 norm of the whole matrix instead of independently constraining each column.
  • Used atoms norm are oddly not 1 (which should minimize the objective function). Caused by the aforementioned bug.
  • Centric atoms seem to appear when the dictionary is more constrained.
  • Measure how \(E\) is similar to \(D^T\).

Notes

  • FISTA is not guaranteed to be monotone.

Meeting 5 (27.03)

Goal: final review of the optimization schemes. We now move on implementation (toward ISMIR paper) and will come back to optimization later.

Achieved

  • Review of the proposed optimization scheme by Xavier which uses primal-dual (Chambolle-Pock) [5] for the three convex minimizations.
  • Complete batch training optimization scheme:
    1. \(\min_D\) : FISTA [3].
    2. \(\min_Z\) : FISTA with adaptive restart [4].
    3. \(\min_E\) : A forward-backward based primal-dual formulation which exploits the gradient (unlike Chambolle-Pock). It is a generalization of Chambolle-Pock and ISTA [2]. Unlike these two, the method is however not accelerated. There exists variants for this purpose.
  • Characteristics:
    1. No trick, the formulation fits each solver.
    2. Optimal convergence in \(O(1/k^2)\) for \(Z\) and \(D\).
    3. Linear convergence for \(Z\) after some iterations thanks to adaptive restart.
    4. All available gradients are exploited.
    5. Each minimization is distributable, i.e. the sub-problems are independent. The matrices only need to be reunified in the outer loop.
    6. Easy to add the \(tr(ZLZ^T)\) regularization. We would however loose the parallelization ability by definition of the problem (smooth gradient for each dimension of the sparse code).
  • The paper git repository is now on GitLab (was previously local).

Discussion

  • Why not set \(E = D^T\) ? See mathematically or with experiments if it is equivalent.
  • Why constraints on rows of \(E\) ? It was a mistake by Xavier, they should be on columns.
  • Discussion of my optimization scheme.
  • How to clearly states rows and columns of matrices ?
  • pyGSP status: can create KNN graphs and compute the graph Laplacian. There is some unit tests but it is preferable to compare with matlab.
  • Literature reviews:
    1. Are other people working on clean optimization schemes for deep learning ?
    2. Optimization schemes for sparse coding. Does everybody use FISTA for Z ? What are they using for D ?
  • Agenda:
    1. Auto-encoder prototype for images (easier to have some feelings).
    2. Setup: pre-processing and post-processing.
    3. Merge 1 and 2: results with genre recognition.
    4. Interpret results (classification accuracy, learned basis).
    5. Write paper (6+1 pages): abstract April 27th, deadline May 4th.

 Next week

  • Start the implementation of a prototype in Python for images.
  • Write down the graph definition.
  • Prepare a new revision of the paper for review.
  • Continue to read the amassed papers.

Notes

  • Another convex optimization algorithm by Nesterov [1].
  • Nice discussion of duality with algorithms in [2].
  • Other methods to investigate for \(\min_E\): SDMM (a parallel generalization of ADMM) [6], ppxa [7], generalized forward-backward [8]. Ils sont lents, mais pas de convergence optimale possible de toute manière. Autre solution: enlever \(||EX||_1\) (et résoudre avec FISTA), la contrainte (et résoudre avec Chambolle-Pock accéléré) ou les deux (et résoudre par inversion de matrice ou gradient descent). Convergence rapide dans tous les cas.
  • Xavier: the Lipschitz constant is less important for Chambolle-Pock than it is for FISTA.

References

[1] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function.
[2] Komodakis, N., & Pesquet, J. C. (2014). Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. arXiv preprint arXiv:1406.5429.
[3] Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202.
[4] O’Donoghue, B., & Candes, E. (2013). Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 1-18.
[5] Chambolle, A., & Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1), 120-145.
[6] Pesquet, J. C., & Pustelnik, N. (2012). A parallel inertial proximal optimization method. Pacific Journal of Optimization, 8(2), 273-305.
[7] Combettes, P. L., & Pesquet, J. C. (2008). A proximal decomposition method for solving convex variational inverse problems. Inverse problems, 24(6), 065014.
[8] Raguet, H., Fadili, J., & Peyré, G. (2013). A generalized forward-backward splitting. SIAM Journal on Imaging Sciences, 6(3), 1199-1226.

6e_agenda_optim

Meeting 4 (20.03)

Goal: final discussion about the optimization schemes.

Achieved

  • Prepared and installed new computer.
  • Improved the draft paper with feedback from Xavier and Pierre.
  • Refined the two optimization schemes. Complete version of the online training optimization scheme.
  • Study of proximal splitting algorithms for convex optimization:
    • Forward-backward splitting: FISTA (dual prox), Primal-Dual (Chambolle-Pock)
    • Douglas-Rachford splitting: ADMM (uses augmented Lagrangian)
  • Completed the draft paper section manifold learning: better motivation. Various small improvements.

Discussion

  • Optimization
    • We should not search the optimal encoder with \(\min\limits_E \|z-Ex\|_2^2 + \|Ex\|_1\). The system is largely underdetermined as we only have one example. We should instead work with \(X\) and \(Z\).
    • The batch scheme is well-posed, good for math. Furthermore it allows a parallel with LeCun (through ADMM), but it may be slow due to the two nested loops. It can be accelerated by not finding the exact solution of each sub-problem, but to stop after some iterations with a good enough approximation. The online scheme targets implementation (approximate and faster).
    • Xavier proposed a primal-dual algorithm to minimize for \(D\). We may use PD for everything, even \(Z\) (we would have used FISTA).
    • Three properties of non-convex optimization: monotony (of the energy function), convergence (local, global, critical points) and speed. Convergence is hard to prove.
    • The multivariate non-convex optimization is guaranteed to be monotone if each convex sub-problem is optimally solved.
    • We should validate our algorithm with ground truth experiments. We may use the Fourier transform and its inverse for \(D\) and \(E\). Then we can choose \(X\) and directly compute the \(Z\). Our algorithm should then be able to retrieve one of the unknowns knowing the two others.

Next week

  • Write down properly the complete optimization scheme: outer-loop with 3 inner-loops (PD for D and E, FISTA/PD for Z).
  • Write down the graph definition.
  • Prepare a new revision of the paper for review.
  • Start to implement.
  • Continue to read the amassed papers.

5_optim