Transcript:

In this video, I’m going to talk about graph Convolutional networks and how they’re an extension to traditional neural networks that take full advantage of graph structure. When I first got into this topic, I found some of the papers and diagrams, a little confusing, but it’s actually pretty simple. So if you stick it on to the end. I’m confident you’ll be able to understand how GCN’s work without having to spend the hours reading the literature so first. I’ll talk about how Gcn’s work in general, and then I’ll present a content abuse example to really work through it step-by-step. So if you like this type of content or you’d like to hear about the live stream journal clubs, we do please subscribe at my website or join us on discord and only to both of those will be in the description below. The idea of message passing in a graph is a really powerful concept because a lot of the graph algorithms can be understood from that perspective, in a nutshell. The idea is that a node in the graph can send and receive messages along its connections with its neighbors. This can be thought of as happening in two steps. First nodes will send out a message about itself to its neighbors and next nodes, collect the neighbor messages they receive and use them in some way to update itself and understand its environment in a fraud use case you might see how known fraud labels can propagate through the graph by continually passing and collecting messages. This is the essence of the label. Propagation algorithm. Here’s a little schematic showing how this works conceptually in a graph that connects accounts with their shared attributes like IP address and credit card numbers in label propagation. Each node in the graph starts with an initial state, and that state is updated by receiving messages from the other nodes that it’s connected to, and this repeats in the next iteration, except each node now starts with its updated state, which takes its neighbors into consideration, think of this as smoothing the label information across a neighborhood for a deep dive on how this works check out the first video in this series, which is linked in the description below. Gcn’s can be understood as a simple message passing algorithm, but whereas label propagation just passes messages about the label value Gcns do this for an entire vector of input data. So an account doesn’t. Just tell us neighbors. I’m known fraud or not, but my credit card is from the US and my IP address is from Vietnam and my billing address is in Ireland and whatever else you know about the account If flayed propagation is label smoothing in a neighborhood think of GCN as feature smoothing, Let’s make this concrete by talking through a simple calculation and a GC and layer for any node in the graph first get all of the attribute vectors of its connected nodes and apply some aggregation function, like an average, not only does aggregation make sure that the representation is the same size, regardless of the number of neighbors, but it makes some intuitive sense that a node might be represented by the average of its neighbors next pass this average vector through a dense neural network layer, which is a fancy way of saying, multiply it by some matrix and apply an activation function. The output of this dense layer is the new vector representation of the node. So now a node isn’t just an average of its neighbors, but the average of its neighbors passed through some nonlinear function. When talking about how a single. GC and layer works throughout this video. I’ll keep coming back to this. Refrain first aggregate. The neighbors then pass that to a standard neural net. If you can keep this clear. Gcn’s will be a lot easier to understand. Let’s go to the next step in complexity and say you have a two layer. Gc N First. Let me point out that we have so far focused on how a single node is updated, but this process must be done for every node in the graph. Each node collects messages from its neighbor’s aggregates those messages and passes the resulting vector through a standard neural net to get a new vector that represents the note. The note is then represented by this new vector for a second layer. All you do is repeat the same process, except the input is the updated vectors from the first layer. This is similar in concept to a traditional, fully connected neural network. The raw input goes into the first layer in the output of the first layers, the input of the second layer and so on the same is true with GCNs, except there’s a pre-processing step at the beginning of each layer, where a node has to first get the values of all its neighbors and aggregate them. Okay, so you can aggregate neighbors and pass them through a dense neural network layer. So what, well, this opens up? Some interesting learning opportunities consider that the size of the node vectors coming out of the gcn layer is determined by the number of units in your neural network layer or in other words, the number of columns in your weight matrix. If we want to classify each node like fraud or not fraud, we can set this output dimensionality to 1 and use a typical binary cross entropy loss function with back propagation to train all of our parameters. Let’s talk through a simple content abuse example using a single layer. Gcn, let’s say we have a graph where each node is a tweet and then two tweets are connected if they were posted, using either the same IP address or the same account and further. Let’s say we want to use the actual content of the tweet as a node attribute. So maybe we hash each of the words and count them up, or we do something fancier like using pre-trained word embeddings, but regardless of technique, let’s say we end up with a 10 dimensional vector representing the content of each tweet. If you just wanted to classify each piece of content individually, you could just pass these vectors into a logistic regression model or whatever classifier you wanted to use. But this would not take advantage of the IP or account ID information. Alternatively, you could use a label propagation on the graph structure, but this wouldn’t take advantage of the content of the tweet. The GCN uses both and works as follows for a given node first aggregate. Its content vectors with all of those that were posted using the same. IP address or account ID. That aggregation function might be some are average and here let’s just use average for a single letter GC and model. This neighborhood. Average vector is multiplied by a 10 by 1 weight matrix and then passed through an activation function to result in a single number for each node that corresponds to the probability of that tweet being abusive content. This is effectively the same as using the neighborhood, averaging as a pre-processing step, where the output is then used as input to a logistic regression model First aggregate. The neighbors didn’t pass to a standard neural net. Now let’s consider a two letter. G Sienna. For the first step, we still average all the neighborhood content vectors, but this time, instead of multiplying this by a ten by one weight matrix to output, a single fraud prediction number from the first layer was instead use a ten by five weight matrix so that we output a five dimensional vector. Now every node in the graph is not represented by the original 10 dimensional content vectors, but instead by a five dimensional vector, that’s output from the first GC in layer. This is the input to our second GC and layer here we once again gather all the neighbors vectors and average them together for every node. This results in a single neighborhood aggregated five dimensional vector for each node, which we can then multiply by a five by one weight matrix to get our final fraud predictions from the second layer. So with two GC n layers, each node is pulled their neighbors vectors two times, one for each layer. In both cases, they aggregate the vectors they get and pass the result into a dense neural network layer. The important part is that for a node binary classification tasks, the last layer should have an output dimension of one and the internal hidden layers can be whatever size that you want. This is no different than any other neural network being used for binary classification. You just have these additional computation steps of aggregation based on the graph neighborhoods in between the layers, it’s also worth, emphasizing that the number of GC and layers puts an upper limit on how far a signal can travel, for instance with a two layer. Gc n. The message passing operation happens twice. This means the signal coming from any particular message can travel a maximum of two hops away from the source node. If Long-range connections are important to your problem, you’ll need more GC and layers, but the story doesn’t stop here. GC ends just starting to scratch the surface. For instance, what if your graph has connections of different types? Maybe some accounts are connected because they share an IP address. Maybe others are connected because they share a credit card number. Do we just average all of the neighbors together, regardless of the connection type? What if some connections are more important than others? Well, this is where relational. Gcn’s come in. And that’ll be the topic of our next video. So see you then.