Matthew Chung
4 min readFeb 5, 2021

--

The Dog Pooping in My Yard Transformer Detectron PyTorch Tutorial

Part 2

The Problem

The problem is some owners do not pick up their dog poop. I believe this is not a problem isolated to where I live, in the bay area, but is ubiquitous to the world. To combat this problem, We are going to develop the Dog Pooping in my yard Transformer Detectron.

This is a 3 part tutorial wherein:

  • Part 1 gets something working using transformers via the Timm library
  • Part 2 unpacks what is going on under the hood with a code-first approach.
  • Part 3 will be putting this into production so we can catch dogs pooping in my yard in the wild — TBD

Here is a link to Google Colab and Github.

In this part, we’ll explore look at the visual transformer source code within the Timm to get a better intuition for what is going on.

But first, let’s look at this illustration by Google.

The Timm VisionTransformer is the main model for this application and I encourage you to take a look at the source. It is what we are going to use to implement the image above. Pretty early in the code, there’s this line here for PatchEmbed. It corresponds to the grid in the image above.

What this doing? Transformers take a 1D sequence of token embeddings, where every token knows something about every other token.

But what about with images? We could take an image and flatten it to 1D, and that might be fine for small images. But since every pixel knows a little something about every other pixel, it doesn’t scale with a large image.

So instead of that, we can flatten and break into patches, in this case, patches of size 16. If we look at the math

width, height = 224
patch_size = 16
width / patch_size * height / patch_size = 196

Also, if we look at the default for embed_dim, it’s 768, which means each of our patches will be 786 pixels long. Let’s give it a shot.

The image above has 16 patches where ours has 196 but the idea is the same.

Extra Learnable Class Embeddings

Now if you look at the animated gif, there are the two Extra Learnable Class Embeddings, which are passed into the Transformer Encoder. The first is the pos_embed or positional embedding (0,1,2,…) and the second is the empty pill next to the pos_embed, which is the class embedding. Just looking at the source, we first concatenate the class tokens with the patches and then add the position embedding.

The next part of the animated gif we’re going to focus on is the Transformer Encoder which is this ModuleList.

So we see this Block thing is stacked depth times. But what is Block? Check out the link. You should see something similar to the following.

Looking at this code, it seems there are a few things we’ll need to explore. But let’s just run it to see what happens when we take this and combine it with the inputs from above.

Uh oh.

So let’s explore what Attention is. First, let’s get the overall picture with this image from Peter Bloem. For more details on this, the Peter Bloem tutorial is very good.

From the above, we can see 3 weight matrixes Wq, Wk and Wv which stand for queries, keys and values. They are trained parameters which we'll explore more in the code below but at a high level:

  1. Wq, Wk and Wv are all multiplied by the incoming embedded vector to form q, k, v.
  2. q and k are multiplied and softmax to give w
  3. w is multiplied by v and summed to form y

Let’s take a look at the Attention source code. If you follow the link, it will take you to something similar to this.

We can see the q, k, v are created by the output of a linear layer. Later in the forward we can see q multiplied by k followed by a softmax then finally multiplied by v

Uh oh.

Mlp head

Luckily this one is more straightforward. If you look at the animated gif, the last step is an Mlp head, which is just some linear layers with dropout. Let’s define it below and try again.

And that’s it. In our training loop, we do this calculation many many times, but this should give you enough to dive into the source code and explore yourself.

references

--

--