Discussed April 15th, presented by Nauman, notes by Michaël

### Summary

Goal: embed a graph in an Euclidean domain to use the structure as additional features (as the eigenvectors would give, but computationally efficient, parallel / local and allowing online update)

Basic idea: apply word2vec [2] to non-NLP data.

Insights:

* Language modeling: a sentence can be seen as a random walk on an unobserved vocabulary graph.

* Graph embedding: a random walk is like a sentence of an imaginary language.

Notation: node v in V, embedding phi(v) in R^d

Traditional language model: Pr[v_i | phi(v_1), …, phi(v_i-1)]

i.e. predict the next word v_i given some preceding words

word2vec / SkipGram relaxation [2]: mult_j Pr[v_j | phi(v_i)] (1)

i.e. predict the context (all j in a window) given the embedding of v_i

Random walks are used to extract those contexts from the graph structure.

Goal of Hierarchical Softmax [3]: reduce the O(`|V|`

) cost of (1), because you need to compare phi(v_i) to all the other embeddings phi(v), i.e. a `|V|`

-class classifier, by approximating it as a serie of O(log `|V|`

) binary classifiers. It is constructed as a binary tree where each node b is also embedded as psi(b) in R^d. Contrary to language models, this tree is easy to build and its structure does not change during learning because we know the graph. It is simply a hierarchical coarsening of the graph: pairs of nodes are grouped hierarchically. So first layer has `|V|`

nodes, second as `|V|`

/2, third has `|V|`

/4 etc. There are log `|V|`

such layers.

Learning:

1) a vertex v is selected at random

2) its context is extracted by a windowed random walk

3) SGD with back-propagation in the tree is used to update the graph embedding phi(v) and the representations psi(b) of the traversed nodes b.

In the end you learned a multi-scale embedding of the graph ! 🙂

### References

[1] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY, USA: ACM), pp. 701–710.

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds. (Curran Associates, Inc.), pp. 3111–3119.

[3] Mnih, A., and Hinton, G.E. (2009). A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds. (Curran Associates, Inc.), pp. 1081–1088.