- the graph is given from outside, not created from the standard data matrix. It encodes what they call “context”, that depends on the application. I.e. the context of a node is its neighboring nodes.
- it is a kind of simple neural network that passes both the input data and embeddings (to be learned) through a non-linear function (max(0, x)), and finally from a softmax to predict the labels. They only used one hidden layer.
- The graph is only used in learning the embeddings, but the embeddings are used both for graph context prediction AND for class label prediction (therefore both loss functions are used for learning the embeddings).
- For input queries that don’t reside on nodes of the graph, the embeddings can be computed as the outputs of a hidden layer fed with the input query.
- to train using the graph they uniformly sample a short random walk from the graph, then from this they pick two nodes with a maximum distance of d hops. This is done because getting all possible pairs is too expensive and this sampling seems more efficient for (block) stochastic gradient descent.
Explanations of word2vec: