An Intuitive Explanation of GraphSAGE

By R?za Özçelik

Original post: https://towardsdatascience.com/an-intuitive-explanation-of-graphsage-6df9437ee64f

DeepWalk is a transductive algorithm, meaning that, it needs the whole graph to be available to learn the embedding of a node. Thus, when a new node is added to existing ones, it needs to be rerun to generate an embedding for the newcomer.

In this story, we introduce GraphSAGE[1], a representation learning technique suitable for dynamic graphs. GraphSAGE is capable of predicting embedding of a new node, without requiring a re-training procedure. To do so, GraphSAGE learns aggregator functions that can induce the embedding of a new node given its features and neighborhood. This is called inductive learning.

We can divide GraphSAGE into three main parts as context construction, information aggregation, and loss function. Below we describe each part separately.

Context Construction

Similar to word2vec and DeepWalk, GraphSAGE also has a context-based similarity assumption.

GraphSAGE assumes that nodes that reside in the same neighborhood should have similar embeddings.

Similar to DeepWalk, the definition of the context is parametric. The algorithm has a parameter that controls the neighborhood depth. If is 1, only the adjacent nodes are accepted as similar. If K is 2, the nodes at distance 2 are seen in the same neighborhood as well.

Remark that having = 2 means nodes at distance 4 can affect each other’s embeddings through the node in the middle. Therefore, increasing the walk length can introduce undesired information sharing between nodes. having a too large can even cause having the same embedding for all nodes!

Neighborhood exploration and information sharing in GraphSAGE. [1]

Information Aggregation

Having defined the neighborhood, now we need an information sharing procedure between neighbors. Aggregation functions or aggregators accept a neighborhood as input and combine each neighbor’s embedding with weights to create a neighborhood embedding. In other words, they aggregate information from the node’s neighborhood. Aggregator weights are either learned or fixed depending on the function.

To learn embeddings with aggregators, we first initialize embeddings of all nodes to node features. In turn, for each neighborhood depth until K, we create a neighborhood embedding with the aggregator function for each node and concatenate it with the existing embedding of the node. We pass the concatenated vector through a neural network layer to update the node embedding. When each node is processed, we normalize the embeddings to have unit norm. The pseudo-code can be found below.

Pseudocode of GraphSAGE algorithm. [1]

The advantage of learning aggregator functions to generate node embeddings, instead of learning the embeddings themselves, is inductivity.

When the aggregator weights are learned, the embedding of an unseen node can be generated from its features and neighborhood.

As a result, aggregators remove the necessity of re-training when new nodes are introduced to the graph. Note that this is quite common in social networks, web, citation networks and so on.

Loss Function

Until now, we have described a procedure to generate node embeddings. Yet, to learn the weights of aggregators and the embeddings, we need a differentiable loss function. Based on our intuition, we want neighboring nodes to have similar embeddings and independent nodes to have distant embedding vectors. The function below satisfies these two conditions with two terms.

Loss function of GraphSAGE. [1]

Here u and v are two neighbors and the loss computed for u. The first term promotes maximizing the similarity of embeddings of and v as we desired. In the second term, we have a variable Qwhich is is the number of negative samples and v? is a negative sample drawn from negative sample distribution. A negative sample in this context means a non-neighbor node. This term tries to set apart embeddings of these two nodes. Lastly, σ is used to denote the sigmoid function as usual.

Remark that this is an unsupervised loss function that can be minimized with no labels. To use GraphSAGE in a supervised context, we have two options. We can either learn node embeddings as the first step and then learn the mapping between embeddings and labels, or we can add a supervised loss term to loss function and adopt an end-to-end learning procedure. This flexibility is valuable.

More on Aggregators

GraphSAGE owes its inductivity to its aggregator functions. We can define various aggregators that are either parametric or nonparametric. As a non-parametric function, we can use simple averaging. It other words, we can average embeddings of all nodes in the neighborhood to construct the neighborhood embedding.

A parametric function could be an LSTM cell. Yet, LSTM cells are designed for sequential operations and have memories. Hence, the order that the neighbors are fed to LSTM affects the neighborhood embedding, though there is not an apparent order. To alleviate this, random permutations of the nodes can be fed to LSTM. The parameters of LSTM would be learned when minimizing the loss function.

Another learnable aggregator is a single layer neural network followed by a max-pooling operator. To do so, we pass each neighbor’s embedding from a non-linear layer and apply an element-wise max operation to their outcomes. In the paper, this function is shown as the most promising one based on the experiments.

Though more complex aggregators can be designed, simplicity is desired since aggregators affect training time drastically. An ideal aggregator should be simple, learnable and symmetric. In other words, it should learn how to aggregate neighbor embeddings and be indifferent to neighbor order, while not creating a huge training overhead.

Conclusion

GraphSAGE is an inductive representation learning algorithm that is especially useful for graphs that grow over time. It is much faster to create embeddings for new nodes with GraphSAGE compared to transductive techniques. Additionally, GraphSAGE does not compromise performance for speed. It was tested on three different datasets that entail node classification, node clustering and across graph generalization and outperformed the existing solutions.

Nowadays, there are extensions of GraphSAGE to heterogenous networks as well as novel inductive approaches. Yet, GraphSAGE adopted a pioneering and influential role in inductive graph representation learning.

References

[1] Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in Neural Information Processing Systems. 2017.

原文地址:https://www.cnblogs.com/jcleung/p/12257618.html

时间: 2024-08-03 00:05:28

An Intuitive Explanation of GraphSAGE的相关文章

What is an intuitive explanation of the relation between PCA and SVD?

What is an intuitive explanation of the relation between PCA and SVD? 36 FOLLOWERS Last asked: 30 Sep, 2014 QUESTION TOPICS Singular Value Decomposition Principal Component Analysis Intuitive Explanations Statistics (academic discipline) Machine Lear

False Discovery Rate, a intuitive explanation

[转载请注明出处]http://www.cnblogs.com/mashiqi Today let's talk about a intuitive explanation of Benjamini-Hochberg Procedure. My teacher Can told me this explanation. Suppose there are $M$ hypothesis:$$H_1,H_2,\cdots,H_M$$and corresponding $M$ p-values:$$p

Why one-norm is an agreeable alternative for zero-norm?

[转载请注明出处]http://www.cnblogs.com/mashiqi Today I try to give a brief inspection on why we always choose one-norm as the approximation of zero-norm, which is a sparsity indicator. This blog is not rigorous in theory, but I just want give a intuitive ex

图像的傅里叶变换

图像的傅里叶变换 备份自:http://blog.rainy.im/2015/11/03/fourier-transform-in-image-processing/ Fourier 傅里叶变换(Fourier Transform)是非常重要的数学分析工具,同时也是一种非常重要的信号处理方法.我记得本科课程电路原理中有提到过,但由于计算过于复杂,好像是超出考试范围了,所以并没有深入学习.最近实验中需要对图像进行滤波处理,文献中提到的方法通常是经过傅里叶变换之后对频域进行过滤,将图像中的低频信息与

What are the advantages of different classification algorithms?

What are the advantages of different classification algorithms? For instance, if we have large training data set with approx more than 10000 instances and more than 100000 features ,then which classifier will be best to choose for classification Want

Learning How To Code Neural Networks

原文:https://medium.com/learning-new-stuff/how-to-learn-neural-networks-758b78f2736e#.ly5wpz44d This is the second post in a series of me trying to learn something new over a short period of time. The first time consisted of learning how to do machine

机器学习、深度学习的理论与实战入门建议整理

引言 拿到这份文档时想必你的脑海中一直萦绕着这么一个问题,"机器学习/深度学习要怎么学呢?(怎么入门,又怎么进一步掌握?)".关于这个问题其实并没有一个标准答案,有的人可能适合自底向上的学,也就是先从理论和数学开始,然后是算法实现,最后再通过一些项目去解决生活中的实际问题:有的人则可能适合自顶向下的学,也就是在弄清楚什么是机器学习及为什么学机器学习后,先确定一个系统性的用机器学习来解决实际问题的程序,然后找到一个合适的工具,接着再在各种数据集上做练习以不断加强自己的实践能力与巩固对算法

broswer

Web browsers are probably the most widely used software. In this book I will explain how they work behind the scenes. We will see what happens when you type 'google.com' in the address bar until you see the Google page on the browser screen. The brow

(zhuan) Building Convolutional Neural Networks with Tensorflow

Ahmet Taspinar Home About Contact Building Convolutional Neural Networks with Tensorflow Posted on augustus 15, 2017 adminPosted in convolutional neural networks, deep learning, tensorflow 1. Introduction In the past I have mostly written about 'clas