Transcript:
[music] hi guys! Welcome back to our channel. That is focusing on applied data science projects in this video. I would like to talk about sparks. ML Lib and Mlib is basically a package that is built and included in in spark and we’re going to be using Spark Version 3 Okay, in this video, and you will actually see how you can actually implement a quick model directly using Mlib. I’m not going to go into why we should be using ML Lib, uh, specifically, because you might think, like, oh, Mlib is basically just another package, similar to psyche learn and so on and so forth, but there are, uh, very specific reasons why you would want to use that, and one of those would be that. Let’s say you you want to have the ability to scale okay, and you need you want to leverage spark for pre-processing your feature generation so that you can basically reduce the amount of time that it might take to actually produce training and test sets from a very large, uh, amount of data and that would be like one use case and another one would be, for example, when your input data or the model size becomes way too difficult to actually put on just one machine and therefore you want to use spark to actually do the heavy lifting, so these two use cases are very specific for using using spark. And, uh, you will find that the more projects. Uh, you work on? You will have these type of issues and then spark can actually help you with that. And therefore let’s go right ahead and, uh, and and see how we can actually use spark to create, um, in our case, a logistic regression model. But first I’ll just go through the steps that we will be, uh, basically going through in in our implementation, so we’re going to be loading a data set a financial, uh, synthetic, basically a synthetic financial data set from kaggle and we’re going to check the data types and based on those data types we’re going to decide what type of feature preprocessing we need to do. And in our case, we’re going to just need to, uh, to do one hunt encoding for one feature, and then afterwards we can actually train the model and then we’re gonna check some statistics on, uh, on that as well, so let’s go ahead and start. Uh, working on it. In this. Um, first style. Basically, I’m just creating a spark session as you can see and I’m just importing types and functions and let’s check the version of spark and we have version three, which is the latest at this point, lets. Go ahead now and load the data set. It’s basically a synthetic financial data set from kaggle. You can check the data set here at this link, so go ahead and and download that. And basically what I’ve done. I i’ve just randomly sampled 10 of the data set so that it loads faster. But of course, you can use the full, the full data set and read this. Uh, this file and let’s print the Schema, so we see what we’re dealing with. So now we have, we see all these features, and we have the feature is fraud, right, so basically, this is our target, and these are our features, lets. Check the data to see. I’m just printing two rows and you can see. We have type amount, name origin and so on and so forth is a binary target. Lets, uh, let’s just select a couple of features, because just for the the sake of this, uh, tutorial is actually easier to just work with a smaller number of features, and I selected the type amount old balance and new balance and, of course, uh, the the target as well is fraud. Lets, uh, see the new data set, and now we just have these these features. Now when we look at the edit, we can see that type is a categorical variable and therefore we’re going to need to do some some preprocessing here amount, all balance and new balance. These are continuous, and they seem to be in the same scale. Therefore, we won’t be, We won’t need to do any sort of preprocessing here, but let’s go ahead now and and see how we can how we can leverage ML Lib to actually do the heavy lifting. If this would be a very big data set the first thing that we need to do is split the data into a train and test set. And because here we don’t have. Basically, this data set doesn’t have any sort of order in it. We can just split it randomly, of course, if we would have some sort of time series, then random split wouldn’t work and we we would need to use ranking and then splitting the data set a little bit differently, but because here we don’t have any time series, we can just use random split and we’re going to split it in 70 percent will be trained and 30 will be the test set. And, of course I’m just setting a seed so that we can reproduce this train and test set accordingly, right, so let’s see the length, so we have 445 000 records in a train in the train set and almost 200 000 in in the test set. And, of course, if we check the train, basically, the train set is exactly like our original data frame, right what we need to do now, and we already observed when we saw the data. Basically, we need to check the data types because any type of of of string is treated as a categorical variable, but sometimes we might have numeric features that we want to be treated as categorical and vice versa. Right now for us, it’s pretty straightforward. We only have a string column. We can see that all the other ones are are continuous, but this is something that you have to take into account right because you might have numeric features that actually might be categorical, so not only string variables will be categorical, so we need to carefully identify which columns are numeric and which are categorical and lets. Check the D types for our train set and we can see that the amount of old balance and new balance are doubles. Which is which is okay, and we have the type, which is a string, okay, and therefore we need to check. We need to actually select the categorical columns and the numerical columns and therefore we’re just doing a list comprehension here in which we check whether the data type is string, and then we add it to the list and the same for the numerical columns. We check for double, right. Which is the type double. But the column name is not is fraud because we only want to include the features that are used as predictors. So let’s run this and we can see that our numerical columns are amount all balanced and new balance and our categorical columns are formed out of just type. So for this reason, we’re going to be using a spark transformer, basically, one hot coding and a high level concept of Eben Lib is that you need to use transformers so that you can transform convert some raw data in a certain way and one hot encoding is one transformer that we can actually use so that we transform our columns in numerical representations that can be understood by the by the machine learning model and one hat encoding requires first to do indexing right to do to use string indexer, as in our case that converts a single feature first into an index feature. And normally, if you would have some sort of ordering, you can just use the string indexer so you can just transform it into an index so that you already have it in a numerical form, but what string indexing does it actually adds ordering and when it comes to machine learning models, you don’t want to assume a specific type of order where there is no order, right, and therefore one had encoding actually does that it actually helps you. Transform those indexes into vectors formed out of zeros and one’s right, rather than using it as an ordered numerical basically list. So of course, you can find out more about one hot encoding on these links, and, of course, about the string indexer that converts basically a feature into an index feature. Just check out these links and you can find out more about that, but I’m sure that you already know about one hut encoding and yeah. I just wanted to make it clearer again so that you have an understanding why we actually need to first use string indexer and then one hot encoding. Let’s go ahead and actually see and count the distinct types for our type feature. We can see that it’s categorical. We have five categories, and if we actually check to see their names, we can actually see that we have five categories and we have transfer cash in cash out payment and debit and these are the counts for each type. Okay, of course, you can use these aggregation and, uh, and grouping so that you can actually check the value counts. It’s actually easier to do it. In, uh, in Panda, you can just use value counts, but here you actually have to group by and then count. And, of course, then show so what we’re gonna need to do. We’re gonna first import onehot encoder and string indexer from ML feature and basically how we’re going to use the string indexer we’re going to create a list comprehension for each column in the categorical columns, And if we check it now, lets, uh, let’s run this. You can see that we have only one string indexer because we only have one categorical column and we’re going to do the same. For one hut encoding we’re going to have a list comprehension, and we’re going to one-half encode all the string indexed columns in categorical columns and we’re going to output them to another feature. Basically, the name of that column and one hat encoder. And if we check this as well, we can see that we only have like one one hot encoder, because we only have the type column in our categorical columns now. The next thing that we need to do is assemble this vector so that it can be used by our machine learning model because in spark, you actually need to transform all of your features into just one feature that is a vector consisting of all of the all of the features that you want to be added into into the machine. Learning model as predictors and vector assembler. Does this because this is the case in most classification and regression algorithms. You want to get your data into a column of type double so that this is the way you actually represent the label and then you also need to have a column of type vector, right that represents the features and this is what Vector Assembler. Does it transforms those features into that specific vector type column, so we’re going to import vector assembler, and let’s set the inputs for the assembler, so we’re going to have first the numerical columns and because we didn’t do any transformation on the numerical columns because they are already in the same range, so we’re not going to need to standard scale them or or something like that. Basically, we can just add them as they are and for the, uh, the type column we’re going to add the one had encoded version of that column and we’re going to do that, and if we check the assembler input, we can see that we have the amount we have the old balance new balance, and then, of course we don’t have the type we have type underscore one HUD encoder because we have the one hot encoded version of the type column now, and we can easily assemble that vector by providing the input columns and then the output column and we have the vector Assembler features column that will consist of all of those inputted features into, uh, into a vector type as it’s required by any classification or regression problem in in spark. Now that we have this vector assembled, we can actually use this to create a pipeline and the pipeline is another high level concept in spark that actually helps you build a full pre-processing pipeline and, of course, model training as well in in spark and it’s something similar to scikit-learn’s pipeline as well. Therefore, it should be quite easy to to understand the concept, basically all of these pre-processing activities and so on and so forth and and model training activities. You can pass them as stages in in that pipeline and all you need to do is basically fit that pipeline in order to to train the model in our case. Basically, I’m just going to use the pipeline for for the pre-processing part, and therefore I’m going to add the stages so first I’m going to add the string indexer the one hot encoder and then the vector assembler. Okay, and I need to add the vector assembler as as a list, so we can actually check the stages, so we have three stages we have first. We do the string indexing on the columns that we specified in the in the string indexing, so we’re not doing string indexing on all. The columns we’re just using the string indexing from here and we only have write the columns in the categorical columns and then we’re doing one hot encoding in the same way, and then we assemble the vector as per our specification here, so now what we need to do is just import pipeline and then we can set the stages and then we fit it on the train set, and then we will transform the test set because we will first need to fit on the train and then transform on the tester because the test set is unknown to us, right, that’s. Why we’re using a test set? Okay, If we would fit on the full data frame, then we would be fitting on something that we don’t really know, right. The tested is specifically so that you can test whether the model would actually perform right in in in the way that we wanted to. And that’s why we want to have an unseen data set. Which is our test set, lets. Uh, let’s do this and we fit it on the train set, and then we transformed the tested and lets, uh, let’s select and see our data frame our data set now, so you see, we have the type amount, all balance and new balance, and then we have our vector assembled features that we’re talking about that consists that consists of basically a vector consisting of, uh, the type amount all balance and new balance. But as you can see, our type is basically one hot encoded. Okay, so our type is formed out of this vector. That is that is, uh, that consists of zeros and ones and we have a position. The one is positioned based on on on that on that implementation. So this is basically how you actually want to encode and how you create a vector out of, uh, out of the features that you want to pass into your machine learning model and the next step that we’re going to do. We’re going to. We’re going to train a logistic regression on on the Vector Assembler features. But before we do that, I would like to thank you for visiting our channel. And for subscribing to us, we have videos about basically applied data science projects that you don’t really find in many places we try to actually bring a practical way for you to learn data science rather than just going through all the mathematical concepts and all that stuff because that makes it quite hard to to to get started with machine learning and therefore, our goal here is to actually help you learn machine learning by actually doing right, not just learning concepts but actually doing in the in the in the real world. Okay, so thank you again for for visiting our our channel and don’t forget to subscribe and definitely stay up to date with what we post. Okay, so lets, uh, let’s go back and I’ll go ahead and train a logistic regression model on our data. So, of course, we import from ML classification. We import the logistic regression and we’re going to use for our data, right, we’re going to use the Vector Assembler features and we’re going to have to name them features. This is a requirement from, uh, from spark and we’re going to use is fraud. The column is fraud as the label. Okay, because spark requires a column, which is of a vector type called features and the column of double called label. Okay, that’s why we need to rename our columns to features and labels and label lets. Go ahead and do that. And if we check our data, we can see that we have features, which is exactly as as earlier. We have a vector all formed out of the features that we want to be added to our model and then the label that is to be predicted, so we’re going to fit our our data to this, Um, logistic regression model, and then we can actually see the end area under and now we can actually check a model summary, which is the ROC curve and then we can actually check the precision and recall by accessing these these metrics from model summary. This is all there is to it, It’s pretty straightforward, and I really hope that you got a first hand experience with how easy it is to actually use ML Lib. Because when you think about spark, you think that is very, very complicated, but actually, when you see it in in a practical way, and you understand how to do the feature engineering and the feature preprocessing, then it’s pretty straightforward. A lot of people are find it difficult because of the of the feature pre-processing, uh, aspect of of dealing with with spark because you actually need to transform your features into just one feature. That is a vector formed out of the previous features. But once you actually get a hang of it and you can always revisit this video, You’re going to find it easier and easier to think about it in those terms, and you’re going to use spark to it to its full potential. So I really hope you enjoyed this video and I’ll see in the next one. [MUSIC].