## DeepWalk: Online Learning of Social Representations

Discussed April 15th, presented by Nauman, notes by Michaël

### Summary

Goal: embed a graph in an Euclidean domain to use the structure as additional features (as the eigenvectors would give, but computationally efficient, parallel / local and allowing online update)

Basic idea: apply word2vec [2] to non-NLP data.
Insights:
* Language modeling: a sentence can be seen as a random walk on an unobserved vocabulary graph.
* Graph embedding: a random walk is like a sentence of an imaginary language.

Notation: node v in V, embedding phi(v) in R^d

Traditional language model: Pr[v_i | phi(v_1), …, phi(v_i-1)]
i.e. predict the next word v_i given some preceding words

word2vec / SkipGram relaxation [2]: mult_j Pr[v_j | phi(v_i)] (1)
i.e. predict the context (all j in a window) given the embedding of v_i

Random walks are used to extract those contexts from the graph structure.

Goal of Hierarchical Softmax [3]: reduce the O(|V|) cost of (1), because you need to compare phi(v_i) to all the other embeddings phi(v), i.e. a |V|-class classifier, by approximating it as a serie of O(log |V|) binary classifiers. It is constructed as a binary tree where each node b is also embedded as psi(b) in R^d. Contrary to language models, this tree is easy to build and its structure does not change during learning because we know the graph. It is simply a hierarchical coarsening of the graph: pairs of nodes are grouped hierarchically. So first layer has |V| nodes, second as |V|/2, third has |V|/4 etc. There are log |V| such layers.

Learning:
1) a vertex v is selected at random
2) its context is extracted by a windowed random walk
3) SGD with back-propagation in the tree is used to update the graph embedding phi(v) and the representations psi(b) of the traversed nodes b.

In the end you learned a multi-scale embedding of the graph ! 🙂

### References

[1] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY, USA: ACM), pp. 701–710.

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds. (Curran Associates, Inc.), pp. 3111–3119.

[3] Mnih, A., and Hinton, G.E. (2009). A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds. (Curran Associates, Inc.), pp. 1081–1088.

## Revisiting Semi-Supervised Learning with Graph Embeddings

Discussed April 1st, presented by Vassilis, notes by Vassilis

### Summary

The paper we reviewed today is not as close to our stuff as we thought initially… It is actually quite
complicated and it builds a lot on other people’s works that are already complex. Some interesting
points:
1. the graph is given from outside, not created from the standard data matrix. It encodes what they call “context”, that depends on the application. I.e. the context of a node is its neighboring nodes.
2. it is a kind of simple neural network that passes both the input data and embeddings (to be learned) through a non-linear function (max(0, x)), and finally from a softmax to predict the labels. They only used one hidden layer.
3. The graph is only used in learning the embeddings, but the embeddings are used both for graph context prediction AND for class label prediction (therefore both loss functions are used for learning the embeddings).
4. For input queries that don’t reside on nodes of the graph, the embeddings can be computed as the outputs of a hidden layer fed with the input query.
5. to train using the graph they uniformly sample a short random walk from the graph, then from this they pick two nodes with a maximum distance of d hops. This is done because getting all possible pairs is too expensive and this sampling seems more efficient for (block) stochastic gradient descent.

### Notes

Explanations of word2vec:

### References

[1] Yang, Z., Cohen, W., and Salakhutdinov, R. (2016). Revisiting Semi-Supervised Learning with Graph Embeddings. arXiv:1603.08861 [Cs].

### Repositories

1. Paper (LaTex)
2. Code (IPython notebooks)
3. Results (IPython notebooks)

### Notebooks

1. Algorithm derivation: from sparse coding to encoder learning through dictionary learning
2. Implementation of our auto-encoder model
3. Learning Gabor filters from image data
4. Solver comparison (with Xavier matlab primal-dual formulation)
5. Genre recognition
6. An experiment launcher which allows to test the influence of an hyper-parameter on various metrics.

### Results

See the experiments.

## Meeting 12 (19.06)

Goal: report and results review.

### Achieved

• Exploit the idea of the structuring auto-encoder as an hybrid between a sparse and a denoising auto-encoder.
• Disable majority voting: smaller variance due to increased number of samples to classify (i.e. 12 times more). The accuracy increases: it is easier to classify feature vectors than whole clips.

### Main results

• Introducing the graph Laplacian makes the model very robust against noise.
• Improvement of 7% in a noisy or noiseless environment. However in the noiseless environment the addition of the graph is not significant.
• The kNN approximation has an influence on the constructed graph and the accuracy of at least 1.3%. There is room for better performance with a better constructed graph. The quality of the FLANN approximation should be investigated.
• The hyper-parameters do not seem to be very sensible, i.e. we may not have to fine-tune them.
• Speed:
• Noiseless: graph generation is fast (~200s), feature extraction is slow (~2700s)
• Noisy 10%: graph generation is slow (~1200s), feature extraction is fast (~900s)
• Classification: ~400s with graph, ~700s without graph.
• The encoder is adding structure.

### Idea

• It works best when all the objectives have an equivalent value –> auto-tuning of the hyper-parameters ! Can be approximately assessed after one outer iteration already.
• Higher the $\lambda$, higher the importance of the sub-objective. Try to give equal weights.
• Train the auto-encoder (graph, dictionary, encoder) on clean data and test on noisy data.
• Go further and classify individual frames ? That would mean to discard feature aggregation.
• Use NLDR to assess if there is really an underlying structure / manifold in the data ?
• Would be interesting to launch multiple simulations with the same hyper-parameters without seeding the RNG to see if the results are stable.

### Experiments

• Keep smallest eigenvalues.
• Hyper-parameters fine-tuning.
• Increase sparsity in the presence of noise.
• Accuracy results to be published (report, presentation).
• (k) Without noise.
• (k) With 20% noise.
• (j) Encoder.
• Performance stability (over various sources of randomness).
• (i) Distance metric.
• Zero mean for euclidean.
• (h) Graph weight lg, noiseless setting.
• Training / testing ratio.
• (g) Without graph, without noise.
• (g) With graph, without noise.
• Graph vs no graph vs spectrograms.
• (f) With noise.
• (f) Without noise.
• Better graph
• Scaling (features, samples, dataset). minmax vs std.
• Disable feature scaling. Then scale the noise ! And verify tolerance.
• (e) Distance metric.
• (d) Noise level.
• (c) Classify individual feature vectors, i.e. no majority voting.
• (b) Understand the influence of the graph in a noisy data setting. Warning: the graph was created with noiseless data.
• (a) Noisy signal hypothesis.

### Discussion

• Au final le résultat sera un plot de baseline, no graph et graph w.r.t. noise level. Conclusion: la préservation de la structure (1) extrait de meilleure features et (2) est robuste au bruit
• Et si chaque feature est normalisée indépendamment. Ça enlève le biais envers les features à forte variation, mais est-ce que c’est ce qu’on veut ? Xavier: “oui dans le cas general.”.
• Xavier: “on peut appliquer le fameux pagerank regularization, qui est tres robust a des graphs mal construits”.
• Fait intéressant: il semble que l’encodeur ajoute de la structure ! Car l’énergie de Dirichlet diminue.
• Xavier: “oui, mais tu ne dois pas oublier que tu fais du *transductive* learning, c’est a dire tu apprends les features en utilisant training + TEST data. C’est pour cela que tu n’as pas une amelioration significative. Si on faisait du *supervised* learning, c’est a dire on utilise seulement les TRAINING data, alors ce probleme est bcp plus challenging que le transductive probleme, et la je pense que nos resultats avec graph seraient bien meilleures! C’est un commentaire que je te conseille d’ajouter a ton resultat pour le mettre en perspective. Aussi la premiere chose a faire apres le PDM est de faire du *supervised* learning avec graph et le comparer avec no graph, je pense que l’on aura des (bonnes) surprises!”
• Is there anything we can say about good enough local minima ? Xavier: “oui, il peut y avoir des local minima qui sont d’excellentes solutions a des learning problems.”.
• Or that many local minima are actually similar and we don’t care in which we fall ? Xavier: “NO! a bad solution and a good solution can have the same energy!”.
• Why energy formulations are good ? Xavier: “Many reasons s.a. good understanding, robustness, existence of solutions, design and analysis of optimization algorithms…”.

## Meeting 11 (15.06)

Goal: results discussion and review of the report

### Achieved

• Fix various issues with the graph creation (audio_graph.ipynb). See git commits for further details.
• Integrate the Dirichlet energy (via the graph Laplacian) in the objective function (auto_encoder.ipynb).
• More efficient implementation of trace(Z.dot(L.dot(Z.T))).
• Run various experiments using the graph.
• Implement a relative stopping criterion for the outer loop.
• Make a script (audio_experiment.ipynb) to perform automated experiments, i.e. to test the effect of an hyper-parameter on various metrics (e.g. accuracy, speed, sparsity, objective etc.) by automatically generating plots. It also compares to the baseline.
• Take the mean of multiple 10-fold cross-validation runs as the main metric (audio_classification.ipynb).
• Add an ls (redundant) parameter to control the weight of the sparse codes (see experiment k).
• Complete reports of the newest results are now published online.

### Main results

• Introducing the Dirichlet energy (via the graph Laplacian) helps to extract more robust features in the setting of noisy signals.
• Introducing the Dirichlet energy (via the graph Laplacian) helps to extract better features:
• 5 genres: from 79 (+/- 2.7) to 81 (+/- 3.4)

### Idea

• Avec notre graphe, on devrait être plus robuste aux données perturbées n’est-ce pas ? (Puisqu’on impose de la régularité sur le manifold.) Est-ce qu’on aurait créé une sorte d’hybride entre un sparse et un denoising auto-encoder ? Le tout sans devoir perturber les données avant training.
• We may not increase much the accuracy, but become much more robust to perturbations. It is the idea behind NL-means.

### Discussion

• Center the data before measuring the cosine similarity measure ?
• Tune hyper-parameters in the context of noisy data.
• Number of good predictions in majority voting.
• Do not necessarily show the baseline. 😉

### Experiments

• (m) Influence of graph parameters.
• (l) Influence of ld (small dataset).
• (k) Influence of ls (small dataset).
• (j) Influence of lg (small dataset, rtol=1e-5).
• (i) Influence of convergence (small dataset, lg=100).
• (h) Influence of lg, half dataset (500 songs, 644 frames, m=256), rtol=1e-3. Ran for 3h30.
• Observations:
• lg=0.1: 2208s, 5 outer, sparsity 6.6%, 78 (+/- 3.8)
• lg=0.5: 1860s, 4 outer, sparsity 8.5%, 77 (+/- 4.9)
• lg=1: 2118s, 5 outer, sparsity 8.2%, 78 (+/- 6.7)
• lg=5: 1024s, 10 outer ?, sparsity 40%, 77 (+/- 6.1) (bad convergence)
• lg=10: 1010s, 10 outer ?, 49.8%, 76 (+/- 6.0) (bad convergence)
• lg=20: 973s, 10 outer ?, 64.3%, 75 (+/- 5.5) (bad convergence)
• lg=50: 989s, 10 outer ?, 70%, 73 (+/- 4.5) (bad convergence)
• Conclusions:
• Atoms are all similar as we use the same initialization (via RNG seed). Except the not converged ones.
• lg seems to have some impact, larger than ld.
• Problem: Dirichlet energy does not count for much as it is 2 order of magnitudes smaller.
• (g) Influence of lg, small dataset (500 songs, 149 frames, m=128), rtol=1e-3.
• lg=1: 292s, 3 outer, sparsity 10.3%, 72 (+/- 4.6)
• lg=10: 330s, 3 outer, sparsity 17.1%, 69 (+/- 5.4)
• lg=100: 120s, 1 outer, sparsity 78.5%, 55 (+/-  5.9)
• (f) More efficient computation of $$tr(Z^TLZ)$$.
• Original np.trace(Z.dot(L.dot(Z.T))): 286s, 72 (+/- 5.2), single eval 49.6 µs
• Hadamard product np.multiply(L.dot(Z.T), Z.T).sum(): 282s, 72 (+/- 6.9), single eval 34.6 µs
• Einstein summation np.einsum(‘ij,ji->’, L.dot(Z.T), Z): 279s, 73 (+/- 5.5), single eval 30.6 µs
• Speed increase: 2%
• (e) Graph Laplacian as float32.
• HDF5: 50MB to 40MB (500 songs, 149 frames)
• float64: 296s, 72 (+/- 6.4)
• float32: 283s, 72 (+/- 8.1)
• Speed increase: 4%
• (d) Runtime test.
• 500 songs, 149 frames, m=512, rtol=1e-5: 7902s, 74 (+/- 3.8)
• 500 songs, 149 frames, m=128, rtol=1e-3: 279s, 73 (+/- 5.5)
• (c) Test on 500 songs with $$\lambda_g=1$$.
• Converge after 7 outer too.
• Same objective values.
• The Dirichlet penalty is an order of magnitude lower.
• A bit less sparse (because of increased smoothing): 1.5% –> 1.8%.
• Atoms look similar.
• Accuracy increase of 1-2%. 80 (+/- 4.0) without normalization, 81 (+/- 3.4) with minmax.
• Slower: from 5h30 to 8h15 to extract features.
• (b) Introducing the Dirichlet energy in the objective, small dataset.
• (a) Fix the graph creation.

## Meeting 10 (05.06)

Goal: discussion of the results.

### Achieved

• Minimize Z first.
• Initialize Z with zeros.
• Run dozen of experiments on the cluster (200 and 500 songs). A new instance is capable of working with the full dataset (1000 songs) with 30GB of RAM.
• Test hyper-parameters with less frames but more genres (to diminish cross-validation variance). It also makes the problem harder, as we are near perfection with two classes.
• Create the graph (audio_graph.ipynb).

### Main results

• Feature extraction (only Z and D) improves accuracy over raw CQT spectrograms.
• 2 genres: from 96 (+/- 4.7) to 99 (+/- 3.2)
• 5 genres: from 76 (+/- 3.1) to 79 (+/- 2.7)
• 10 genres: from 56 (+/- 5) to 65 (+/- 3.6) (LeCun is at 79-83, best is 93)
• Outside extreme value, ld does not seem to have a large impact.
• Around 10 minutes to run a KNN search (with FLANN) and create the graph for the complete dataset (around 1.2 million nodes with dimensionality 96).
• Training time is proportional to $$\lambda_d$$, linear to m.
• OpenBLAS is 1.4 times faster than ATLAS.
• Even if the objective function is not convex, we fall into similar local minimums despite random initialization.

### Experiments

1. (a) Minimize for D first or Z first. Z first needed if we initialize Z with zeros or project columns of D on the ball (not int). Not much difference on runtime. Z first looks better on convergence graph if initialized with zeros (instead of random values).
2. (b) Cluster: preprocessing of the whole dataset and feature learning on 200 songs. It is 4-5 times faster than my laptop. The matrix multiplication performed by ATLAS is however not multi-threaded.
3. (c) Need normalization on Z ? No because they have the same energy as X (due to the constraint on D) which is normalized.
4. (d) Train an encoder (with le=ld=10): it does not seem to alter much the performance (either accuracy or training speed). The value of $$\|Z-EX\|_F^2$$ is indeed an order of magnitude smaller than the other objectives, which indicate that it has little impact. Z is however two times less sparse. The purpose of E is thus solely to speed up the inference of Z. Need however to experiment with other values of le.
5. (e) Learn a dictionary over 1000 songs: not enough memory, even with the 20GB of memory of my cluster instance.
6. (f) Learn a dictionary and classify for 500 songs. Feature extraction (only Z and D) improves performance from 76 (+/- 3.1) (using spectrograms) to 79 (+/- 2.7) for 5 genres.
7. (g) Influence of N_outer. Just be sure it has converged. With rtol=1e-5 it converges after 11 iterations for 200 songs and 7 for 500 songs. Set to 15 to be sure (additional iterations are cheap).
8. (h) Influence of ld. Something around 10 sounds good.
• ld=10: 11 outer is enough, data and prior terms on the same order of magnitude, sparsity 3%.
• ld=100: 15 outer is not enough, data term an order of magnitude lower than prior, all objectives higher than ld=10, sparsity 33%, atoms seem noisy. Training 2x slower than ld=10. But same classification accuracy. Seems however too much.
• ld=1: 6 outer is enough, prior term an order of magnitude lower than data, all objectives lower than ld=10, sparsity 0.1%, peaks in atoms are smaller. Training 5x faster than ld=10. The classification accuracy however severely drops to 58%, probably due to a too sparse representation. Definitely too low.
9. (i) Influence of m. In the end we’ll use 512 with 1000 songs.
• m=512: 8 outer is enough, 4x slower (linear to m), higher prior and lower data objectives, sparsity 1.1%, classification accuracy climbs to 99 (+/- 3.2).
10. (j) Learn a dictionary over 1000 songs (ok with 30GB of RAM, takes 10 hours). Accuracy of 65 (+/- 3.6).
11. (k) ATLAS vs OpenBLAS. ATLAS (the version packaged by Ubuntu) is mono-threaded while OpenBLAS distributes the computation on the 12 cores. Note that large matrix multiplication is however memory bandwidth limited so that the cores do not run at 100%. OpenBLAS is 1.4 times faster. Use a common seed ?
• OpenBLAS: 4379s
• ATLAS: 6152s
12. (k) Influence of Nvectors. Maybe there is too little features with 1 vector. We should test that on the full dataset. We will work with Nvectors=6 as we have no improvement over baseline with Nvectors=1 and thus cannot interpret the effect of hyper-parameters.
• Nvectors=6 (Nframes=149): 74 (+/- 4.3), baseline (Xs) 68 (+/- 6.1)
• Nvectors=1 (Nframes=149): 65 (+/- 4.9), baseline (Xs) 65 (+/- 3.9)
13. (l) Influence of convergence (rtol). Double training time for 1%. We’ll decrease rtol for final accuracy measurements.
• rtol=1e-3: 718s, 3 outer is enough, objective 1.138e5, sparsity 4.6%, 74 (+/- 4.5)
• rtol=1e-5: 4379s, 9 outer is enough, objective 1.016e5, sparsity 1.6%, 74 (+/- 4.3)
• rtol=1e-7: 9817s, 15 outer is not enough, objective 1.005e5, sparsity 1.4%, 75 (+/- 4.6)

### Observations

• No need to constrain E. All its values are smaller than 1 because X is normalized. As D is constrained (normalized in practice because of $$\ |Z\|_1$$), it does not change the energy thus Z has the same energy as X. That would lead to the single objective $$\|Z-EX\|_F^2$$ when minimizing for E. Use a second order method like Newton instead of (slow) gradient descent ?
• We take more time to train Z than D while D is what we want to learn in training.
• Variance in cross-validation is high if the number of genres / songs are low.
• Z is indeed much more sparse than X.
• Outer convergence is reached when each inner minimization converges immediately (one step).

### Ideas

• Run KNN only once. Then we can change the weights (in response to changes of $$\sigma$$).
• Normalize X in audio_preprocessing ? Then no scaling in audio_classification nor audio_features.
• Write a script who runs audio_features and several audio_classification with configurable parameters (ld, m, N) to ease testing and hyper-parameter tuning. audio_classification should classify with Xs, Z (and approximate Z).
• Tune $$\lambda_e$$ such that all objectives are in the same order of magnitude ?
• Test the Laplacian by introducing noise in the signal used to construct the graph (moving it off the manifold) and optimize for smoothness using the constructed Laplacian. It should push the points back to the manifold.

### Discussion

• How to have a undirected graph (i.e. symmetric weight matrix) ? Do we even want it ? Yes. We will use $$\max(W, W^T)$$.
• Use a TV regularization instead of the Dirichlet energy. Define the continuous and discrete TV then how to solve it.
• Approximate TV by only taking low frequencies into account. We would do it by filtering out the highest eigenvalues of the Laplacian matrix.

## Meeting 9 (29.05)

Goal: discussion of classification accuracy using the features extracted by our auto-encoder.

Related notebooks: audio_features, audio_classification

### Achieved

• Resolved memory exhaustion.
1. Prevent data copy in the PyUNLocBoX toolbox. The toolbox was internally storing multiple copies of the data when instantiating functions and solvers. Store a reference instead. See git commit for further details. The memory consumption of our application was then divided by 5. No significant impact on speed.
2. Use 32 bits floating point instead of 64 bits. It saves half the memory and increase the speed by 1.6. We are not memory bandwidth limited anymore. Computations almost always fully load two CPU cores.
• Projection in the L2 ball, not on the ball surface. It is a convex constraint. No significant improvement in speed.
• Insert the energy auto-encoder in the pipeline.

### Discussion

• Data normalization: features or samples ?
• Decrease dimensionality for faster experiments.
• Approaches for faster convergence.

### Next

1. Experiment with hyper-parameters.
2. Create the graph.
3. Insert the Dirichlet energy in the objective.

## Meeting 8 (04.05)

Goal: current implementation state (auto-encoder, classification) and preliminary classification results. Discuss music conferences and schedule.

Related notebooks: audio_classification, auto_encoder, audio_features.

### Achieved

• Complete the description of the pre-processing steps in the paper.
• Store data in sklearn format, i.e. $$R^{Nxn}$$ which is the inverse of our math notation. Redo the pre-processing on the entire dataset.
• Implementation of post-processing, i.e. classification: feature aggregation, feature vectors visualization, features rescaling, labels generation, linear SVM, majority voting, random train and test sets splitting, cross-validation.
• Classification results: see the audio_classification notebook for results on accuracy and speed for 2, 4 or 10 genres using spectrograms or raw audio, various classifier implementations and diverse scaling methods.
• Implementation of the auto-encoder as a sklearn Estimator class. Ease the integration with scikit-learn, enable to use it in a sklearn Pipeline to avoid transductive learning. See the auto_encoders notebook for the implementation and comparison_xavier for an usage example.
• Performance boost of a factor 10 by using ATLAS or OpenBLAS instead of numpy’s own BLAS implementation. Half of the execution time is now used for matrix multiplications by numpy.core._dotblas.dot. The limiting factor for speed is now the memory bandwidth. ATLAS performs multi-threaded matrix multiplication.
• Unsupervised feature extraction on a reduced dataset (2 genres, 10 clips) using the new implementation of the auto-encoder. See the audio_features notebook for details.
• Memory exhaustion when working on the full dataset. Three ways to circumvent it:
1. Work on a reduced dataset. For now only 2 genres of 10 clips pass. It is not sufficient for significant accuracy measurements.
2. Store $$X$$ and $$Z$$ on disk via HDF5. As we are already bandwidth limited, introducing SATA and the SSD in the way may further decrease performance.
3. As each column is independent when minimizing for $$Z$$ and each line is independent when minimizing for $$D$$, we can independently work on a subset of the problem in RAM and keep the whole data $$X$$ and $$Z$$ on disk.

### Discussion

• Was not ready in time for ISMIR 2015 (h5-index of 31). We will publish for ICASSP 2016 (h5-index of 47). The deadline is September 25th.
• Mitigation of memory usage: reduce the number of frames per clip.
• Schedule: report deadline June 19th, oral defense mid-July, conference paper September 25th.
• Can Xavier be the expert ? We could then do the defense at the end of June.

### Next

• Classify data after unsupervised feature extraction.
• Observe what the dictionary learned.
• Look for an increase in accuracy.

## Meeting 7 (17.04)

Goal: comparison with Xavier’s implementation of primal-dual, audio pre-processing.

Related notebooks: comparison_xavier, gtzan, audio_preprocessing.

### Achieved

• Prototype on images:
• Dictionary is indeed full. The constraint less or equal instead of less is  however more efficient as it is convex.
• Comparison with Xavier’s implementation and dataset (each patch has a zero mean, one patch per pixel).
• Obtained dictionary is very similar despite very different implementations (algorithms and platform).
• My implementation is slower because of numpy’s implementation of BLAS operations. We will switch to ATLAS or OpenBLAS.
• My Z is sparser despite a larger $$\|Z\|_1$$ objective.
• Why using an under-complete dictionary, i.e. $$m < n$$ ? For speed only.
• More iterations give a better convergence, i.e. sharper atoms.
• Improvements of the description of the optimization schemes and the graph definition.
• Implementation of the audio pre-processing steps: frames and CQT. See the audio_preprocessing notebook for details.
• Convert the GTZAN dataset to an HDF5 store. Load and save all subsequent data in HDF5. See the gtzan notebook for details.
• Many improvements to the PyUNLocBoX. See the repository for details.

### Discussion

• LCN can be viewed as one step of inverting the heat equation. Similar to shock filters [1].

### References

[1] Osher, S., & Rudin, L. I. (1990). Feature-oriented image enhancement using shock filters. SIAM Journal on Numerical Analysis, 27(4), 919-940.

## Meeting 6 (02.04)

Goal: discussion of the first prototype implementation of our auto-encoder and the learned atoms on image patches.

Related notebooks: formulation, image.

### Achieved

• Implementation of a prototype in Python.
• Optimizing for Z last leads to a lower global objective function (two Z terms, only one D).
• The solution does not depend too much on the initialization, even if the global problem is not convex.
• See the formulation notebook for a construction of the algorithm form the base (linear least square fitting) to the version with encoder (no graph yet). Tests on synthetic examples are included.
• See the image notebook for an application to image data along with observations. We try here to recover Gabor filters.
• Results from Xavier: J’ai testé les 2 contraintes sur le dictionary: (1) ||d_j|| = 1 et (2) ||d_j|| <= 1. L’algorithme est 2x plus rapide avec (2) ! Je pense que cela vient simplement du fait que la contrainte (2) est convexe, contrairement a (1).

### Discussion

• $$E$$ seems to approach $$D^T$$ when it is less constrained, i.e. $$m>>n$$. This was due to an implementation error which constrained the L1 norm of the whole matrix instead of independently constraining each column.
• Used atoms norm are oddly not 1 (which should minimize the objective function). Caused by the aforementioned bug.
• Centric atoms seem to appear when the dictionary is more constrained.
• Measure how $$E$$ is similar to $$D^T$$.

### Notes

• FISTA is not guaranteed to be monotone.