Goal: embed a graph in an Euclidean domain to use the structure as additional features (as the eigenvectors would give, but computationally efficient, parallel / local and allowing online update)
Basic idea: apply word2vec [2] to non-NLP data.
Insights:
* Language modeling: a sentence can be seen as a random walk on an unobserved vocabulary graph.
* Graph embedding: a random walk is like a sentence of an imaginary language.
Notation: node v in V, embedding phi(v) in R^d
Traditional language model: Pr[v_i | phi(v_1), …, phi(v_i-1)]
i.e. predict the next word v_i given some preceding words
word2vec / SkipGram relaxation [2]: mult_j Pr[v_j | phi(v_i)] (1)
i.e. predict the context (all j in a window) given the embedding of v_i
Random walks are used to extract those contexts from the graph structure.
Goal of Hierarchical Softmax [3]: reduce the O(|V|
) cost of (1), because you need to compare phi(v_i) to all the other embeddings phi(v), i.e. a |V|
-class classifier, by approximating it as a serie of O(log |V|
) binary classifiers. It is constructed as a binary tree where each node b is also embedded as psi(b) in R^d. Contrary to language models, this tree is easy to build and its structure does not change during learning because we know the graph. It is simply a hierarchical coarsening of the graph: pairs of nodes are grouped hierarchically. So first layer has |V|
nodes, second as |V|
/2, third has |V|
/4 etc. There are log |V|
such layers.
Learning:
1) a vertex v is selected at random
2) its context is extracted by a windowed random walk
3) SGD with back-propagation in the tree is used to update the graph embedding phi(v) and the representations psi(b) of the traversed nodes b.
In the end you learned a multi-scale embedding of the graph !
[1] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY, USA: ACM), pp. 701–710.
[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds. (Curran Associates, Inc.), pp. 3111–3119.
[3] Mnih, A., and Hinton, G.E. (2009). A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds. (Curran Associates, Inc.), pp. 1081–1088.
]]>Explanations of word2vec:
See the experiments.
]]>Related notebooks: audio_features, audio_classification
Related notebooks: audio_classification, auto_encoder, audio_features.
Related notebooks: comparison_xavier, gtzan, audio_preprocessing.
[1] Osher, S., & Rudin, L. I. (1990). Feature-oriented image enhancement using shock filters. SIAM Journal on Numerical Analysis, 27(4), 919-940.
]]>Related notebooks: formulation, image.