Transcript:

In 2005 game company, its software open sourced the engine for their video game quake 3 arena in that source code. Fans of the game discovered an algorithm that was so ingenious it quickly became famous and the only thing this algorithm does is calculate the inverse of a square root. If I had to write a piece of code that would calculate the inverse of a square root. This is how I would do it here. I’m using the C programming language. The same programming language used for Quake 3 but to be fair. I wouldn’t actually write the square root in there myself. People work more closely with the C language or we see, but design have already figured out how to calculate the square root and the provided algorithm to us in a mathh file that we programmers can then just include in our program. So what could possibly be so interesting about the quake 3 algorithm? How does its software calculate inverse square roots? At first glance? It doesn’t seem to make any sense. Where does this number 0x5f3759df come from? What does this have to do with taking square roots? And why is there a disgusting curse word in the second comment in this video? I will show you how with some cool bit manipulation you can take in for square roots and the algorithm that does this bears the name, the fast inverse square root, first of all, why would the game engine want to calculate 1 divided by square root of X? If you want to implement physics and lighting or reflections in your game engine, it helps if the vectors you’re calculating with are normalized to have length 1 because otherwise, your vectors might be too short or too long, and when you do physics with them. Things can go wrong as all of you know? The length of a vector is square root of x squared, plus y squared plus z squared. If you don’t happen to notice, I claim you’ve seen this for two dimensions. It’s just Pythagoras theorem, so if we want to normalize the vector’s length to one, we have to scale everything down by the length of the vector, I mean, obviously, because if we divide the length of the vector by the length of the vector, we obviously get one so all that’s left to do is to divide x y and Z by the length or similarly multiply by one divided by the length. You might already see where this is going. Calculating X squared plus Y squared, plus Z squared is easy and more importantly, really fast in code. You would implement it as X Times X plus y times y plus Z Times Z and all it is is just three. Multiplications summed up additions and multiplications are common operations that have been designed to be very fast. The square root on the other hand is a terribly slow. Operation and division is not much better. This is not good if we have several thousands of surfaces where each has a vector that needs to be normalized, but this also means that here we have an opportunity for speed improvements if we can find even just an approximation of one divided by square root of X as long as it’s fast, we can save precious time. The fast inverse square root is such an approximation with only an error of at most one percent while being three times as fast looking at the code again. We can see that the beginning is pretty harmless. We are given a number called number as input the number we’re supposed to take the inverse square root of first with the variable. I we declare a 32-bit number. Then we declare two 32-bit decimal numbers, x2 and Y, and then we store 1.5 into the variable with the obvious name three halves, the next two lines simply copy half of the input into x2 and the whole input into Y. But it’s after that where the magic happens, take a moment to look at it again. Well, the longer you look at it. The less it makes sense and the comments on the right are not really helpful either, but they do hint that there are three steps to this algorithm puzzling together. These three steps will show us the brilliancy of this algorithm, but before we start with these three steps, let’s first take a look at binary numbers. We said that in the first line, we declare a 32-bit integer in a C programming language called a long that means we’re given 32 bits and we can represent the number with it. But I think you all know how to do that. This is one two, three, four and so on up to around two billion, but in the next line we declare two decimal numbers in C called a float again, we’re given 32 bits and we have to represent the decimal number with it. How would you do that if you and I were designing decimal numbers? This is probably one way we would do it. Just put a decimal point in the middle in front of the decimal point, we count in the usual way, 1 2 3 4. And so on and after the decimal point, there are no surprises either. Just remind yourself that this is binary. So instead of tenths hundredths and thousands, we have halves fourths eights sixteens and any combination of them like a half and a fourth give you Three-fourths also known as 0.75 but this idea is actually terrible, We’ve decimated the range of numbers we can represent before we could represent numbers to around 2 billion now only to about 32 000 Luckily, people much smarter than us have found a better way to make use of those 32 bits. They took inspiration from scientific notation the same way we can systematically represent numbers like 23 000 as 23 times 10 to the 4. And .0034 as 3.4 times 10 to the minus 3 We can also represent them in a binary system where here, for example, 1 1 0 0 0 could, for example, be 1.1 times 2 to the 4. The standard they came up with takes the name. IEEE 754 IEEE Standard 754 defines the following. We are as usual given 32 bits. The first bit is the sign bit. If it is 0 the number is positive if it is 1 the number is negative, but the numbers quake 3 provides the fast inverse square root are always positive. I mean, obviously, they’re positive. If we would have to calculate 1 divided by square root of -5 something definitely has gone wrong. So for the rest of this video, we ignore the sign bit as it is always 0 then the next 8 bits define the exponent that means 2 to the 1 2 to the 2 2 to the 3 2 to the 4. And so on with 8 bits, we can represent numbers between 0 and 255 But that’s not exactly what we need. We also want negative exponents. This is why everything is actually shifted down by 127 so instead of 2 to the 4. We actually have 2 to the 4. Minus 127 If we actually want the exponent to be 4. The bits need to be set to 131 because 131 minus 127 is 4. The last 23 bits are the mantissa as usual in scientific notation. We want to denote one digit, followed by the comma, followed by the decimal places, but with 23 bits, we can represent numbers from 0 to, but not including 2 to the 23 again. That’s not exactly what we need for scientific notation. We need the mantissa to go from 1 to 10 or in binary scientific notation to go from one to two, so we could do something that we’ve already done before. Put a comma after the first bit. This automatically gives us numbers from one to two, but this naive approach is wasteful. You see the people that designed standard 754 realized that in binary, something happens that happens in no other base look at the first digit in scientific notation, the first digit is, by definition, always non-zero, but in binary, there is only one digit that is not zero one, and if we know that the first digit will always be a one, there is no need to store it. Thus we can save one bit by moving the comma one digit to the left and fixing an extra one in the number it represents. Now our mantissa is between one and two, even though 23 bits gave US numbers between 0 and 2 to the 23 we scaled them down to get numbers between 0 and 1 and then we added an extra 1 to get numbers between 1 and 2 and this already is the main part of IEEE Standard 754 but just the so-called normalized numbers. The informed viewer knows that the standard also includes denormalized numbers, not a number infinities and two zeros, But we won’t go into those because in quake 3 It just happens that these are never inputs into our algorithm. Otherwise, something definitely has gone wrong anyway. At no point, should our game engine have to normalize a vector with infinite length for this algorithm and for the rest of this video, it will be useful to think of the Mantissa and exponent as the binary numbers they are. If we are given two numbers, one being the Mantissa and 1 being the exponent 23 bits and 8 bits, respectively, we can get a bit representation with 2 to the 23 Times E Plus M. If you think about it because multiplying E by 2 to the 23 just shifts E by 23 digits. So that’s how one could write the bits, but we get the actual number behind the bits with this formula. This should seem familiar to you here. We have the exponent with 127 subtracted from it. And here we have the mantissa with the extra one in front, but now something completely different for no obvious reason at all. Let’s take the logarithm of that expression. Since we’re doing computer science here, we take the Logarithm base 2 We simplify as much as we can take out the exponent, but then we get stuck, but not so. The creators of quake developer Gary Tarouli, knew a trick to get rid of the logarithm. You see, the trick is an approximation to log of 1 plus X for small values of X log of 1 plus, X is approximately equal to X. If you think about it, this approximation is actually correct for X equals zero and X equals one, but we’ll add an additional term Mu. This correction term can be chosen freely again with Mu equal to zero. This approximation is correct at zero and one, but it turns out that setting Mu to this number gives the smallest error on average for numbers between zero and one so going back to our formula. We apply our trick. As M divided by 2 to the 23 is indeed a value between 0 and 1 We rearrange a little bit more, and we finally see why we did all those calculations. M plus E times 2 to the 23 appears. That’s our bit representation, so let’s think about what we just did. We applied the logarithm to our formula and got the bit representation just scaled and shifted by some constants. So in some sense, the bit representation of a number is its own logarithm armed with this knowledge, we can finally start with the three steps of the fast inverse square root. The first step is actually not complicated. It just looks complicated because its memory address trickery, so we stored our number into y. And now we want to do cool bit manipulation. Tricks floats. Unfortunately, don’t come with the tools. We need to do bit manipulation. The reason you cannot do bit manipulation on floats is that they were never designed to do so. Flows are inherently tied to the IEEE standard 754 longs on the other hand were designed to do bit manipulation on them. For example, here’s one trick bit shifting along to the left, doubles it and bit shifting it to the right halves it and yes, if your number is odd, you do end up rounding, but hey, we’re willing to accept such inaccuracies as long as this means that our algorithm is fast C as pretty much all programming languages do provides a way to convert from a float to a long. This conversion does what most programmers needed to do, namely convert from a decimal number to an ordinary integer as best as it can do. So if we give it a float like 3.33 it converts it to an integer in this case 3 but this is not the conversion. We need here first of all. We don’t care about the resulting integer number. We want to somehow keep our float and secondly, the bits that lie behind our number. Get all messed up! We don’t want this conversion to mess with our bits. The only thing we want to do is to put the bits one to one into a long. The way you achieve this is to convert the memory address, not the number. First we get the address of y. This is the address of a float. Then you convert that address from a float’s address to a long’s address. The address itself doesn’t change, but C now thinks that the number living at that address is now along. So then you read what’s written at that address because C now thinks that this is an address of a long. It will read the number at that address as if it were long like this. We tricked C by lifting the conversion away from the number itself to the address of that number. And that’s how we get the bits of a number into I. I don’t know what else to say that’s. Just how c works, so lets. Just go to the next step. The intuition behind the second step is the following. Remind yourself bit shifting a number to the left, doubles it and shifting it to the right halves it, but what would happen if we did something like this to an exponent, doubling an exponent squares, The number and halving the exponent gives us the square root, but now also negating the exponent gives us 1 divided by square root of X. That’s exactly what we need, so let’s remind ourselves what our goal is here. We have our numbers stored into Y. And our goal was to calculate 1 divided by square root of a Y as I’ve already said calculating this directly is too hard and too expensive, but we’ve extracted the bits from Y, and we’ve seen with the IEEE Standard 754 that the bits of a number are in some sense. Its own logarithm that means in I. We have stored log of y up to some scaling and shifting. I claim that our problem becomes way easier. If we work with logs instead of trying so hard to calculate 1 divided by square root of y, we instead calculate log of 1 divided by square root of Y. We rewrite this to log of y to the power of minus a half so we can take out the exponent calculating. This is stupidly easy. You might think, oh, no, we have a division in there. Didn’t you say, in the beginning? That divisions are slow? Well, yes, but remember that we can do bit shifts now. Instead of dividing by 2 we just bit shift once to the right. This already explains why we do minus. I bit shifted the ones to the right, but why is this number 0x5f37590f here again? Well, because our logarithm is actually scaled and shifted, so let’s calculate and understand where it comes from. Let gamma be our solution. Then we know that log of gamma equals to log of y to the power minus a half, which equals to minus a half times Log of Y. Now we replace the logarithm with the bit representation. And then we just solve for the bits of gamma ill. Spare us the details, but this is the result. The magic number turns out to be the remnants of the error term mu, the scaling factor and the shifting. Now we have the bits of resolution and we can just reverse the steps from the evil bit hack to get back the actual solution from those bits. Well, actually, not the exact solution, Just an approximation. This is why we need the third step. After the previous step? We have a pretty decent approximation, but we did pick up some error terms here and there, but thanks to Newton’s method. We can make a really good approximation out of a decent one. Newton’s method is a technique that finds a root for a given function, meaning it finds an X for which F X equals zero. It does so by taking an approximation and returning a better approximation. And usually you repeat this process until you’re close enough to the actual solution, but it turns out that here we are already close enough to the actual solution that one iteration suffices to get an error within one percent. The only things Newton’s method needs is the function and its derivative and what Newton’s method does is that it takes an X value and tries to guess by how much it is off from being a root it does so by calculating F of X and its derivative, we can write F of X as Y and the derivative as D Y over DX, We have the ratio between Y and the X offset and y itself so to get the X offset, we just divide y by the ratio. So then we simply subtract this offset to get our new X. The informed viewer can now verify that the last line is one such Newton iteration applied to the function F of Y equals 1 divided by Y squared minus X notice that y being a root of this function is equivalent to y being the inverse square root of X. I really encourage you to verify this last line of code since it’s really surprising that even though both the function and Newton’s method have a division in them, the code does not, which means that our algorithm is and stays fast. Now we finally understand the fast inverse square root. It only took us the knowledge of the IEEE standard 754 a trick to outsmart the C programming language, magic bit operations and the calculus behind Newton’s method you.