Versioning data sets and models is really cool, but it isn’t enough, we need to know not just where did a model come from in terms of which storage or repository? But also, how was it actually produced? How did somebody transform some data set through a series of computational steps into a model? So what we’re going to cover Here is how to make a pipeline so you can perfectly record and reproduce all of the computational steps to transform data to models. All right, so to get started. Um, let’s head over to the docs, and there’s some code that you can follow right here. You can just, you know, copy and paste this. Um, and this is going to help you get the, uh, code that you’re going to need for this project. So, um, what we’ve got? Is this file this folder SRC? And it’s full of a couple of python scripts and you can kind of think of these as typical stages of a machine learning project. We’ve got a step to prepare our data. Oh, and the data set. We got that in the last video. So, um, I’ll put a link in, but if you need the data set, follow along from the last video or follow the series. So after preparing data, we can featurize so convert our raw data into some kind of features. Um, then we can train a model and then we can evaluate our model so right now. All of these steps are in their own python scripts and we could just sequentially run them, but that’s probably not rigorous enough for our purposes. Um, so we’re going to introduce a DVC pipeline here. So the first thing I’m going to do actually is. I remember to install all the requirements. We got a requirements file here. So I’ll do Pip installr, okay, and that should install everything that we need cool, all right. I feel pretty good about that. Oh, and just for the record, don’t pay too much attention to the code example itself. Don’t don’t worry about the code. It’s just a pretty typical project, so going back to our docs here. Um, there’s this first command that we can run and this is, you know, one way to start a pipeline, so I’m gonna copy it, and then we’ll talk about what it means later. It’s kind of a long one. Um, so we’re gonna run this and we get this message. Prepare running stage. Prepare, um, and after we do this. We get a couple of new files. We get this file. Dvc yaml so DVC. Yaml is a human readable file. That’s kind of a representation of what we just ran, so there’s two ways to write a pipeline in DVC. One of them is to use this function. DVC run! Um, and another is just to edit a yaml file following this format. And if you don’t have a yaml file, DVC run will create it for you, So we just said DVC run, and we passed a couple arguments like the the name of the first stage here. Um, some of the parameters that are involved that come from our paramsyaml file and those are, in fact, hyperparameters and we’ve got, um, an output. We have a dependency here. Actually, we have two dependencies. We have our data set and we have the script. That’s the python script That’s used for this stage. Um, and finally, we have a command. That is used, you know, that is run at this stage and all of that is represented here in the yaml. Um, so we have our, you know, the name of our stage. The command that we’re going to run our dependencies, which are our data set and the code that we’re running and then, um, some parameters from our parentse’m file. So these are hyper parameters of the experiment and then our output, which is our prepared data. Okay, cool. Um, so let’s actually add another stage now. Um, I’m just running. Clear here to make a little more space. All right, let’s add the next stage. Um, let’s do dvc run. This one is going to be Dot N featurize so we’re saying, let’s make this stage be called featurize new line. Um, in this case, um, we’ve got some parameters and these are declared here in the feature. I stage. We’ve got Max features and engrams, so the way that we represent that is featurize Dot Max features and featurize dot n grams, and now our dependencies are going to be so the script that we’re running as well as our prepared data, which you might remember is the output of the last stage, And now let’s do our outputs. Um, so here we have data features, it’s going to be the output And I just know that from the script. Um, this would be custom for whatever you know, whatever your script is, and finally I’ll give the command that will run when we run this stage, so let’s do Python featurizationpie and it’s going to take the arguments, data prepared and data features. Okay, and now it is creating the stage and running it. We get some, you know. Print statements from our python script featurizationpie. So we know that this is running, okay. Um, and now let’s add another stage, lets. Go back to our dvcyaml. Now this file here now has the stages prepare as well as featurize. Okay, and let’s add the last one this way, so let’s make a well. I don’t know if it’ll be the last, but let’s make the next stage this way train. Okay, and for here, we have our command and the command is going to be Python. SRC Train Dot Pi data features modelpickle, so we’re saying this is we’re going to be running this python script, and it’s going to take these arguments, data features and model Pickle Monolpickle is our output. Um, dependencies comes next and the dependencies for this are going to be SRC train Dot Pi and the next dependency is going to be data features. And now let’s see for this stage. I am going to have some params so for Params. We’ll have train Dot. I think for this, It’s going to be seed and train Dot N estimators and finally for outs. We’re gonna do modelpickle. Okay, cool, so I’m saving this. So now if I do this function, DVC repro that is going to reproduce my DVC pipeline and you might notice that it’s got some messages here. Stage, prepare, didn’t change. Skipping stage. Featurize didn’t change skipping and running stage train with command here, so it only ran the training stage and that kind of makes sense because we previously ran the other stages we ran prepare and we ran. Featurized only train is new. So DVC kind of remembers what you’ve recently run so it can save you some computation time, all right. Um, but let’s try to have some fun with it. So if I go to Paramsgov and maybe I want to do, Let’s see, let’s change and train the number of estimators to 100 Um, and save it, and now let’s see what happens when I run dvc repro so DVC repro is going to try to reproduce the whole pipeline, but it’ll, you know, remember if a part of it doesn’t need to be replaced rerun because it’s, you know, it’s recent. Um, so prepare hasn’t changed. Featurize hasn’t changed. Um, but train did change. Um, and we get this message now. It updated the file DVC lock. So just to give you a peek at what’s happening under the hood. DBC lock is our not so human readable file. It’s kind of a counterpart to DVC Yaml, but DVC Yaml is human readable and writable DVC lock is not so just don’t touch this file. But this contains some hashes or codes that correspond to the versions of dependencies of stages as well as their outputs. Um, so that’s kind of how we keep track of, you know? Do we need to rerun a stage? Okay, so now if I were to go back to my parameters here and I want to change this number of estimators to let’s bring it back to 50 I’m going to run DVC repro. And what do you think will happen? If you guessed that we will not have to rerun the train stage, that was right because it’s currently cached, so we can actually pull back the output from. You know, a couple minutes ago. Um, and put it here. So great could save you some time, and now the last thing that I want to show you Here is a visualization function. I’m going to do clear and let me make a little bit of room for this, so we can really see this. So if I do dvc, Dag, then we’ll get kind of, you know. What is the pipeline that we currently have, which is prepare to featurize to train? Um, and soon we’ll add one more stage to evaluate the model. And then this dag will look even cooler. Now you know how to start? Using pipelines in DVC pipelines are a great way to make your projects more reproducible easier to debug and easier to iterate on pipelines can also help you use powerful automation tools like continuous integration systems because they’re machine readable so to learn even more about the kinds of things you can do with DVC pipelines. Keep reading our docs and for now, Davey and I thank you for watching.