>>. You’re not going to want to miss. I can’t even speak today. You’re not going to want to miss this episode of the? Ai show where we talk about what to do when your data changes. Your Machine Learning models are going to be how to date. We’ll show you how to fix it in this next episode of the Ai Show. Make sure you tune it [MUSIC] >>! Hello, and welcome to this episode of the Ai Show. We’re going to talk about something pretty serious about data and how it literally drifts. Tell us who you are. What you do, my friend >>? My name is Cody Peterson. I’m a Product Manager in the Azure Machine Learning Service and the PM for Data Drift. >> Okay. So when it comes to machine learning? Obviously you need data because the data is like the lemons. We squeeze the juice out of for machine-learning. Can you talk about datasets briefly And then let’s get into. What do you do when your data changes >> Yes? So, as you said, data is really the big difference between machine learning development and regular code development, Data evolves over time and it changes Datasets in the Azure Machine Learning Service. Allow you easy access to your data for training machine learning models and the newly introduced dataset monitoring capability, which allows you to monitor how that data changes over time and be quickly alerted to issues in production >>. So let me see if I understand this right because I know that a dataset is basically like a pointer to some data. >> Yeah >>. But as a pointer, right things can change underneath. So you said there’s this new data monitoring? Can you describe what that is and how it can help us when data drifts >>? Yeah, so the first thing to talk about is just like you said new data arrives over time, and so we built into the datasets this time series tree, which can account for that. Basically, what you do is you set up a timestamp column, either from an actual column in the data or from how the data is partitioned. For instance, if I partition my data by year month day, our system knows that and it can monitor how that data is changing over time. What drift does is it basically sets up a pipeline to look at that data over time and compare it to a baseline? This baseline is usually something like your model’s training dataset or an earlier slice of data >>. So here’s a question, So if I have a dataset, that’s a pointer and I know that that dataset was used to train the underlying data that the dataset points to can still change. Is that right. But when I train it, how do I know which subset of the data was used to train? And then how can I figure out? What’s the difference between what was trained? And now what’s coming in the dataset? >> Yeah. So this is all tracked in the Azure Machine. Learning service. If you use a data set to train a model that gets logged and you can go back and see exactly what data you trained with and reproduce that model. Now if you set up a new training or a new dataset, which has this time series tree and it’s basically taking in the serving data or the data, your model seeing in production over time, that’s what you want to monitor for and compare against the baseline dataset, which could be the model’s training dataset >>. So you mentioned two different styles, right, There’s what you trained it on and then the diff, and then there’s a even more fine grain one, which is what you trained it on, but each one has a timestamp. And then you can see drift a little bit more granular >>. Yes, So the actual product is very generic. You can use it on any time Series data completely outside the context of machine learning, although we’re really targeting in the Azure Machine Learning Service scenarios where you trained a model and you want to make sure the data you’re seeing in production fits well into that training data so that your models performing as expected >>? All right. So I feel like we’ve talked about it. Can we take a look >> Sure? So this is the results page for the beta drift in the new Azure Machine Learning Studio. We only have four charts today, starting with a drift overview. This shows a 0-100 percent data drift percentage. Change over time. >> Okay >>. This gives you an indication of how much your data is drifting over time. We can see on this chart that I’ve set a threshold of 60 percent. If that threshold gets breached, which seems to occur around April 21st, I would get an e-mail alert, which looks something like this. So now I have an alert. I know that my data has drifted and I can go and investigate why >>? So if we go back, what is the delta’s that it’s measuring? You see what I’m saying, because I know you talked about the baseline and the target, but can you tell me? Is this in miles? You see what? I’m saying. What are we measuring >>? Yeah, It’s pretty cool. So we actually use machine learning our selves in the background. So you can think of this percentage as roughly the accuracy of the model that we build to tell the baseline dataset apart from each slice of the target dataset >>. So you’re actually running inference on the new data that’s coming in and seeing how well it’s doing, is that right, >> Somewhat we’re building as separate drift model, which looks at the baseline and compares it to the inference data >>. I see. So it’s even more clever than what I said. It’s sort of >> Yes. Slightly >> That’s. Okay, what you said. But no we’re actually building a machine learning model that tracks the Delta’s overtime >>. Yeah, This is important, because for the same reason that machine learning is important, you can do all these role-based statistics and say things like if my temperature goes out of the range of negative 100-100, then that’s probably a incorrect temperature for what’s outside. But that gets pretty cumbersome as you get more and more features and more and more complexities in the features. So we use a machine learning model to abstract that away and come out with a simple 0-100 number that tells you how much your data has drifted >>. I see. If you’re data has drifted over a certain percentile, That’s an indication that it may be time to retrain your model. >> Exactly >> That’s awesome. So what are the other charts that we have there >> Yeah. So the first thing you want to do after you know that your data has drifted is try to figure out why. So to the right here, we see the drift contribution by feature that shows you exactly which features drifted the most. We can zoom in and go and see which one’s cause drift. In this case, temperature is one of the largest, followed by wban country or region and wind angle >>. What data is this so I can try to get a little bit of a context >>? So this is some weather data from 2019 from the NOAA Isd Open Dataset. >> Okay, >> I’ve filtered it down to only station names that contain the string Florida just to reduce the size of it and make it a little more interesting >>. Well, because it’s Florida >> Yeah, >> Nice. Basically, what you’re looking at is that the predictions now are a little bit more off because the temperature magnitude has changed. And that’s what this chart is showing. I’m I getting this right. >> Yep. We can see here. The baseline is actually January 2019 weather data, whereas the target is the rest of 2019 through October. And I have these list of features, yeah? So after you figure out which features have caused drift, you can go dig into them more down below in the feature details. So I’m seeing my temperature here and we have a list of four metrics right now. Wasserstein, distance mean Min and Max Value? So I can see how this changes over time and on the right-hand side plot the distribution of the data against the baseline that was used. So if I click over here on the right, I’ll see a bit of a bi-modal distribution here of my target dataset in red versus the baseline dataset in blue. If I click over on the left toward January, I’ll see a much more closely aligned distribution of data >>. I see. So basically, you’re comparing the baseline with the target and what we’re seeing is that it was pretty accurate at the beginning of the year, and then it got consistently worse as we got towards spring summer and then because the temperature went up >> Exactly. So in this situation, I’m a bit of a naive data scientist. I trained only on my January weather data and it doesn’t perform well throughout the year. Similarly, if I click on a categorical feature, I can see the histogram of that feature over time. This one, I find interesting because I would expect Florida to only show up in the US. Naive American, but it turns out there’s actually four different countries in this data. Well, it seems like a simple mistake. This is something that can actually happen. When you’re doing your data pre-processing and making assumptions about your data that you want to check into. That’s actually why we see that Bimodal distribution of temperature. >> I see >> It’s from Florida, the state and then other countries >> That’s really interesting. So how do you run this? How do you get this to show up >> Yes? So we have a Ui as well as the Python SDK, It’s pretty simple to run through the UI and the SDK is roughly the same. I’ll go through the Ui very quickly. All you do is give it a name. Choose a baseline dataset. This can be any tabular data. Set that you have. Choose a target dataset. This does have to have that time series trait that we discussed set. Pick a frequency for this to run daily weekly or monthly and then pick a list of features. This is important to turn off features like an index or a day or a year or things that naturally drift over time >>. All right, >>! However, if you make a mistake, you can actually backfill the metrics after you create the monitor. >> Okay >>. After those basic settings you can set up a monitor, which is basically just setting up a pipeline for you in the background, which will analyze that slice of a target dataset against the baseline. You can enter in an e-mail address to receive alerts. Set your threshold And then optionally do a backfill of metrics. This is useful. If a compute job fails or you update the feature list and you want to backfill your metrics over time, >> That’s interesting. So how does it backfill data? What is it that is backfilling with >>? Good question! So it’s backfilling. Basically, if you update any of these settings, >> I see >>. For instance, if I were to come in here on an existing monitor and for some reason, my country or region drifted a ton. But I know it doesn’t matter to my model. I could turn that feature off and backfill my metrics without that feature >>. I see, okay, I see. So if you pick too many things, you can take things out >>. Yeah, exactly. So actually, when I first ran this, I accidentally included this Erin Index column. The drift percentage immediately goes to 100 percent. And I have to go backfilling with >> That’s because the index is always out on a blink. >> It’s in the millions. >> That’s awesome. So now the experiment that it’s running is basically like a time series analysis about what’s going on with your data where it’s changed and then it’s alerting us when there’s a problem. >> Yeah, exactly >>. Well, this is awesome. Where can people go to find out more about this? Or how do they get started with that >>? Yeah, they can go to our documentation at akams/datadrift or to the Azure Machine Learning Studio and get started in there, >> Fantastic! Anything else you’d like to add to close >>? No, >> That’s awesome. Well, I’m particularly really excited about this because it’s really hard to know when you build these models. It’s like a little bit of witch doctoring going on. You know that there’s this weights getting up there, but sometimes you don’t know why things are changing. This is going to go a long way to help us understand what’s going on with our data and further. What’s going on with those models? Thank you for spending time with us >>. Thank you, sir, >>! Thank you so much for watching and learning. All about data drift inside of Azure Machine Learning. Thanks for watching and we’ll see you next time. Take care [MUSIC]!