## DeepWalk: Online Learning of Social Representations

Discussed April 15th, presented by Nauman, notes by Michaël

### Summary

Goal: embed a graph in an Euclidean domain to use the structure as additional features (as the eigenvectors would give, but computationally efficient, parallel / local and allowing online update)

Basic idea: apply word2vec [2] to non-NLP data.
Insights:
* Language modeling: a sentence can be seen as a random walk on an unobserved vocabulary graph.
* Graph embedding: a random walk is like a sentence of an imaginary language.

Notation: node v in V, embedding phi(v) in R^d

Traditional language model: Pr[v_i | phi(v_1), …, phi(v_i-1)]
i.e. predict the next word v_i given some preceding words

word2vec / SkipGram relaxation [2]: mult_j Pr[v_j | phi(v_i)] (1)
i.e. predict the context (all j in a window) given the embedding of v_i

Random walks are used to extract those contexts from the graph structure.

Goal of Hierarchical Softmax [3]: reduce the O(`|V|`) cost of (1), because you need to compare phi(v_i) to all the other embeddings phi(v), i.e. a `|V|`-class classifier, by approximating it as a serie of O(log `|V|`) binary classifiers. It is constructed as a binary tree where each node b is also embedded as psi(b) in R^d. Contrary to language models, this tree is easy to build and its structure does not change during learning because we know the graph. It is simply a hierarchical coarsening of the graph: pairs of nodes are grouped hierarchically. So first layer has `|V|` nodes, second as `|V|`/2, third has `|V|`/4 etc. There are log `|V|` such layers.

Learning:
1) a vertex v is selected at random
2) its context is extracted by a windowed random walk
3) SGD with back-propagation in the tree is used to update the graph embedding phi(v) and the representations psi(b) of the traversed nodes b.

In the end you learned a multi-scale embedding of the graph ! 🙂

### References

[1] Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY, USA: ACM), pp. 701–710.

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds. (Curran Associates, Inc.), pp. 3111–3119.

[3] Mnih, A., and Hinton, G.E. (2009). A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds. (Curran Associates, Inc.), pp. 1081–1088.

## Revisiting Semi-Supervised Learning with Graph Embeddings

Discussed April 1st, presented by Vassilis, notes by Vassilis

### Summary

The paper we reviewed today is not as close to our stuff as we thought initially… It is actually quite
complicated and it builds a lot on other people’s works that are already complex. Some interesting
points:
1. the graph is given from outside, not created from the standard data matrix. It encodes what they call “context”, that depends on the application. I.e. the context of a node is its neighboring nodes.
2. it is a kind of simple neural network that passes both the input data and embeddings (to be learned) through a non-linear function (max(0, x)), and finally from a softmax to predict the labels. They only used one hidden layer.
3. The graph is only used in learning the embeddings, but the embeddings are used both for graph context prediction AND for class label prediction (therefore both loss functions are used for learning the embeddings).
4. For input queries that don’t reside on nodes of the graph, the embeddings can be computed as the outputs of a hidden layer fed with the input query.
5. to train using the graph they uniformly sample a short random walk from the graph, then from this they pick two nodes with a maximum distance of d hops. This is done because getting all possible pairs is too expensive and this sampling seems more efficient for (block) stochastic gradient descent.

### Notes

Explanations of word2vec:

### References

[1] Yang, Z., Cohen, W., and Salakhutdinov, R. (2016). Revisiting Semi-Supervised Learning with Graph Embeddings. arXiv:1603.08861 [Cs].