Transcript:
Hi, folks, my name is. James and I’m a software engineer on the Pi. George team. So we’ve already mentioned in this talk. That two of the core design principles of PI Torch are developer efficiency and building for scale, but how do we bridge the gap between those and make a framework that can support both in this section of the talk? I’m going to dive deep into torch script and Pi Torch It and show how those technologies allow us to deliver a top-notch developer experience while still delivering top machine performance and efficiency. Let’s elaborate on one way that we actually deliver developer efficiency, one of the core principles. We drive forward in. Pi Torch is that building models is just programming Building models leverages a familiar imperative and object-oriented style, which we believe many people to find appealing. Not only that, but a key point is that building. Pi Church models is just programming in Python and lets. See why this is such an important consideration. Why do people like Pi Torch? Well, it’s pythonic. What does that mean when we built? Pi Torch in the Python programming language we inherited a design language called being pythonic. You may have heard of the Zen of Python, which is the small short poem that you’ll see listed on the right of the screen. It espouses several principles that us as well as the rest of the Python ecosystem. Try to follow. How do we espouse these in Pi? Towards specifically one way is that things are simple and explicit models are object-oriented. Python programs and they use all the familiar concepts from regular Python programming second. Things are easy to use because they’re debug or you can use all of your regular print statements. You can use PDB to debug your code and you can use the Python interpreter to try out different ideas and test different things outside of the full program, finally by building our library and Python were able to plug into the rest of the Python ecosystem, especially the numeric computing ecosystem. So you can use these things seamlessly with your PI torch model, so let’s take a look at an example of a fragment of PI torch code on the right. You’ll see a simple convolutional neural net, defined in Pi George in the Init Method also called the constructor. You’ll see that we’re setting up various basic building blocks for the convolutional net were instantiating objects from the torch ññ namespace you can think of the NN namespace as sort of the standard library for Pi Torch, These are fragments of code that we found repeatedly useful, think convolutional blocks drop out linear layers, things like that that we’ve made available to use out-of-the-box batteries included in the Python ecosystem. We set these up and that initialize’s state, such as the different weights for the different layers. In the forward method. We specify how the data actually flows through this convolutional neural network. We’ve introduced a new concept in forward here, which is the F Identifier F is a shorthand for torch and n dot functional. This is basically a set of standard library functions that do not have associated parameters with them so we can use them as regular functions you’ll see in the forward method we’re basically stringing together, the convolutional layers, the pooling layers and the reloj activation layers into the actual architecture of the convolutional net. It’s worth, noting that all this code is just Python code and you can do whatever you would do with normal Python code in terms of debugging and interfacing with other libraries. So Python is great at all, but what about production? I started this section out talking about how we can build a framework that allows us to deliver scale and performance. So how do we bridge the Python ecosystem with those requirements? So when we think about production deployments of dural Nets, there are two basic requirements that consistently show up the first of which is portability. So what this means is? That model should be exportable to a wide variety of environments, for example, not only should they be runnable from a Python interpreter process, but we should also be able to run them in a C++ server running concurrent requests at scale. We’ll run it on a mobile device or an embedded device. We should be able to move our model code. Wherever we want to unconstrained by the pipe language, the second requirement is performance. You can sort of split. The performance requirement into two different sides. One side is the latency constrained side. You can think of different applications such as speech recognition or real-time translation that have hard deadlines on the response of the neural net, The other side of the coin is the throughput oriented cases You can imagine if you’re scaling the inference of a neural net out to thousands or even millions of servers, any improvement in the throughput of the model could translate into literally millions of dollars of savings in cost for servers and electricity and things like that. So how has Pi towards historically done with these requirements for portability? Pineridge models were tightly coupled to the Python runtime, which meant that exporting them and running him them on a mobile device, for example, was difficult and for performance. Pythor’s performance was pretty good, but there are numerous further optimization opportunities that were left on the table that we can’t do with the level of dynamism left in the Python language. Now you might stop me and say why not use a static framework? Why not use the right tool for the job use? Pi Torch for your experimentation and research and use a different framework like Kappa – for the production framework. And we’ve actually tried this. But converting models between Pi Torch and Kappa – actually became a bottleneck in our internal usage at Facebook. Additionally, people want a pythonic experience all the way through they don’t want to have to move out of this good user experience to deploy their models to production so having said all this, we can define some requirements for what the system needs to do. We need it to one faithfully capture the structure of PI towards programs with minimal user intervention and to use that structure to optimize export and run the bottles. Let’s take a look at how toward script and the pipes, which get fulfilled requirement number one of capturing the structure of PI search programs for the purposes of understanding torch script. We need to define basically two modes by charge code can run in the first is the eager mode, which is the normal. Python runtime mode that I’ve explained before, and then the second is script mode so script mode is when Python code is run in our own runtime called the torch Crypt Runtime. This runtime is separate from the Python interpreter and allows us to do things like run things threaded in parallel and do many performance optimizations on the code. So we have these two modes, but we need tools to transition between them. The first tool to transmission from eager mode to the script mode is called the tracer. The API for the tracer is the torch digit trace function. Basically, what this function will do is it will take an existing you your model and take the inputs that you provide it and it’ll run that model and record all of the tensor operations that happen while the model is being run. We turned that recording into a torch script module. What this means is that you can reuse all your existing eager model code and just easily convert it into torch script. Using the trace framework, however, tracing does not preserve control flow and other language features like data structures. Those will all be erased and the only thing that’s preserved is the tensor operations, so things like language models with dynamic control flow will not be faithfully captured by the trace mechanism to get around this limitation. We built a second tool for transitioning your code. The second way to transition your code from eager mode to the script mode is using the script compiler. The API for the script compiler is the torch digit that script function. What this will do is well, it will go in and directly Lexan parse all of the Python code and so it will have full visibility into all of the things that are happening in the code. Unlike the tracer, you’ll see in the example to the right we have in the forward method, a for a loop that’s looping over the size of the zeroth dimension of the X tensor, and we have all this stuff happening like we have control flow where we’ll print statistics on every 10th iteration. Things like this would not be captured by the tracer. The script compiler supports a subset of the Python language, but that subset of the language is becoming more and our rich and expansive as time goes on, for example, we now support creating user-defined classes and using methods on them as well as binding in classes from C++ and calling into those to be debug code that’s been compiled by the script compiler. You can simply remove the script call and debug as normal. Python code, so we’ve covered how we faithfully capture the structure of Pyke Large programs. Now, let’s look at that. How we use that structure structure to optimize and run the programs. The first thing we can do with this captured structure is serialize. The code on the right. You’ll see an example of tracing a torch vision. Resnet 15 model, saving it to a serialized zip file and then in a C++ process using the JIT : : Load API to load that model in and run it and get the results. What this demonstrates is that torch script code can be run entirely separately from the Python runtime. This is useful, for example for running this code on a server or on a mobile device. Now, what do I mean? By this captured structure, let’s get explicit about this when you trace or script compiled by torch code. This is the actual artifact that’s produced. You can see. We’ve captured a list of operations that are actually occurring on values in the code. You can see, we have lots of information here such as what is the scalar type of a tensor. What are the shapes of the tensor? And what’s the actual data flow between operations in this code? A few key design principles of this representation are listed here. One is that it is statically typed. Static typing allows us to make better decisions about optimizations and to provide better and earlier error messages to the user. Second, we use structured control flow, which actually isn’t shown in this example, but it exists, we support arbitrarily nested ifs and loops. So if you need to write something like a recurrent network loop, you can just use the familiar. Python looping constructs and that will get compiled into our IR As these structured control flow constructs. Third, this representation is functional by default. The reason this is useful. Is that means that we can better reason about what transformations are legal on this representation and so we can make even better than optimization decisions. So we’re talking about optimizations. What kind of optimizations do we want to do? Some examples of optimizations we might do include Algebraic rewriting we could fold constants and pre compute them at compile time we could eliminate common sub-expressions and only compute them once and we could eliminate code. That’s not used dead code elimination. Another thing we can do is out of order execution. We could move things around to reduce memory pressure and make efficient use of cache locality. We confuse operations together. We can combine several operations into a single kernel to avoid overheads from round trips to memory and over PCIe we can do target dependent code generation. We can take sequences of operations and lower them into machine code for different platforms. This is often mediated by third-party libraries, such as TPM Halide Glow and Excel A but throughout all these optimizations the guarantee that we provide is that we want to preserve the same semantics. We should get these optimizations for free. There should never be a case where these optimizations are actually changing the result of your program now. All this optimization sounds well and good, but we’ve started out talking about the flexibility of the Python language. The term we use for that is dynamism. How do we deal with dynamism in toward script code? Even with torch crumbs, more static semantics, we still want to preserve the flexibility and ease of use of the eager mode. That means there’s still a lot of dynamism left. We can look at this example here. This is a simple lsdm code fragment. Now, can we fuse the cells? TM Cell and a Mitch machine code for it in order to answer that question. We need to answer several other questions first. For example, what devices are each of these tensors on? How many dimensions do they have? How big are those dimensions? And are we using autograph? Do we need to preserve certain values for the backward pass well? We don’t actually have that information just from compiling this code. It’s something that we can only observe at runtime now. Many of the the answers to these questions are likely to remain static over multiple calls to the model. Whenever you have something that can be dynamic but is likely static. A technique called just-in-time. Compilation may be useful here. We see an overview of the optimization pipeline using just-in-time compilation. First, we want to collect statistics and information about the actual things that are happening in the program at run time, we might collect information about the shapes of tensors what devices they reside on or whether they require a gradient or not once we’ve collected that information from one or more runs of the model, we can pass that information on to the next stage of the pipeline to do optimization. We can do a lot of those optimizations that I mentioned earlier. Fusion rewriting and Code Generation. Once these optimizations kick in we now have an optimized version of the program that is specialized to the behavior that we’ve actually observed, so we can pan that off to the actual interpreter. You can think of the interpret the torch script interpreter as a virtual machine like the Java Virtual Machine, we can execute this optimized program and do other runtime tricks like operation scheduling and parallelism to make it even faster, so do models actually get faster? The answer is yes. In many cases they do. You can look online and find a blog post. We put out called optimizing. Cuda recurrent neural networks with torch script on the slide. Here you can see a chart of the actual run time. In milliseconds of the LS TM model on the X-axi’s, we have different optimizations that we’ve applied through just-in-time compilation as we apply each of these optimizations. The LS TM model gets faster and faster until finally, we approach or exceed the performance of the handwritten Kudi, an end-run time to recap torch script and the pie towards G. It comprised a compiler infrastructure that enables the gradual and fast transition of pipes, which code from research to production without compromising on the user experience for further learning. You can try out the interactive tutorial, please, Google introduction to torch script or use the URL listed on the screen, this tutorial will walk you through the user facing API S and show how you can convert Pi Torch code to torch script and compose the different techniques such as tracing and scripting together to create a full representation of your deep learning model.