Transcript:

[MUSIC] Machine learning has gone through many recent developments and is becoming more popular day by day people from all domains, including computer, science, mathematics and management, are using machine learning in various projects to find hidden information in data. It was just a matter of time that spark jumped into the game of machine learning with Python. So guys, this is crystal air from Ed. Rekha and the topic for today’s discussion is the Pi Spark MLM. So let’s have a quick look at today’s agenda. I’ll start off this video by giving you a brief introduction to machine learning with its various implementations in the industry next, I’ll discuss the three major domains of machine learning, which are supervised unsupervised and the reinforcement learning next. I’ll discuss how Mlf plays an important role in the SPARC environment and finish off this video with a demo on Pi Sparc MLM. So let’s begin now. What exactly is machine learning machine? Learning is a method of Piran analysis that automates analytical model building, using algorithms that iteratively learned from data machine learning allows computers to find hidden insights without being explicitly programmed where to look, it focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Machine learning uses the data to detect patterns in the data set and it just program’s actions. Accordingly, most industry is working with large amounts of data have recognized the value of machine learning technology by cleaning insights from this data, often in real-time organizations are able to work more efficiently or gain an advantage over competitors. Now, let’s have a look at the various industry where machine learning has been used. Government agencies, such as Public Safety and utilities have a particular need for machine learning, they use it for face detection, security and fraud detection now marketing and sales now websites recommending items you might like based on previous purchase use machine learning, they use it to analyze your buying history and promote other items you’d be interested in now. Analyzing data to identify patterns and trends is the key to the transportation industry, which relies on making routes more efficient and protecting potential problems to increase the profitability now coming to the financial service banks and other business in the financial industry used machine learning technology for two key purposes. The first one is to identify important insights in data and the second one is to prevent fraud now coming to healthcare machine. Learning is a fast-growing trend in the healthcare industry, thanks to the advent of variable devices and the sensors that can use data to access a patient’s health in the real time now, finally, in the biometric section, the science of establishing the identity of an individual based on the physical chemical or the behavior attributes of the person is one of the major key advantages of machine learning in the biometrics area. Now let’s have a look at a typical machine learning lifecycle any machine. Learning lifecycle is divided in two phases. The first one is training and the second one is testing now for training, we use the 70 to 80 percent of the data and the rest remain. Data is used for testing purposes. So first of all, we train in the data and you use any particular algorithm to train the data and using that algorithm, we produce a model now after that, we have produced our model now. The remaining twenty to thirty percent of the data is used for the testing purposes. We pass this data to the model and we find out the accuracy of that model with certain tests now. This is what a typical machine learning lifecycle looks like now. There are three major categories of machine learning as I mentioned earlier, which are supervised reinforcement and the unsupervised learning. So let’s understand these terms in detail. Starting from supervised learning. The supervised learning algorithms are trained using label examples such as input where the desired output is known. The learning algorithm receives a set of inputs, along with the corresponding correct outputs, and the algorithm learns by comparing its actual output with the correct output to find errors. It then modifies the model accordingly through methods like classification regression predicting and gradient boosting supervised learning use patterns to predict the values of each label on an additional unlabeled data, It is called supervised learning because the process of an algorithm learning from the training data set can be thought of as a teacher supervising the learning process now supervised learning is majorly divided into two categories, namely classifications and regression. Algorithm’s regression is the problem of estimating operating a continuous quantity. What will be the value of the S&P 500 one month from today? How tall will a child be as an adult? How many of the customers will leave for a competitor this year? These are examples of the questions that will fall under the umbrella of regression now coming to classification classification and deals with assigning observation into discrete categories rather than estimating continuous quantities in the simplest case, there are two possible categories. This case is known as binary classification. Many important questions can be framed in the terms of binary classification will a given customer levers for a competitor Does a given patient have cancer as a given image. Contain a doc or not now? Classification mainly consists of classification trees, support, vector machines and random forest algorithms. Well, as a regression consists of linear regressions, decision, tree’s, bias in networks and flossy classification. Now there are other algorithms like artificial neural network programming and gradient boosting, which also comes under supervised learning algorithms now. Next we have reinforcement learning now. Reinforcement learning is learning how to map situations to actions so as to maximize a reward and often used for robotics, gaming and navigation with reinforcement learning, the algorithm discovers through trial and error, which actions healed the greatest rewards, and the algorithm provides information about whether the answer is correct or not, but does not tell how to improve it. The agent is the learner or decision-maker, whose job is to choose actions that maximize the expected reward over a given amount of time actions are what the agent can do, and the environment is everything. The agent interacts with the algorithm whose ultimate goal is to acquire as much as numerical reward as possible gets penalized each time. Its opponent scores a point and gets rewarded each time. It manages to score a point against the opponent. It uses this feedback to update its policy and gradually it filters out all the actions that lead to penalty. The reinforcement learning is useful in cases where the solution space is enormous or infinite and typically applies in cases where the machine learning can be thought of as an agent interacting with its environment. Now there are many reinforcement learning algorithms. Few of them are the cue learning. We have the SA RSA, which is the state as reward state action. We have deep queue. Network, we have the deep deterministic policy gradient, which is DD PT. And finally we have the T appear. Which is the trust region policy optimization now. The last category of machine learning is unsupervised learning, so as I mentioned earlier supervised learning tasks find patterns where we have a data set of the right answers to learn from whereas in case of unsupervised learning tasks find patterns where we do. Not, this may be because the right answers are unsolvable or infeasible to obtain, or maybe for a given problem, There isn’t even a right answer per SE. A large subclass of unsupervised ask is the problem of clustering clustering refers to grouping observation together in such a way that members of a common group are similar to each other and different from members of other groups. A common application here is in marketing, where we wish to identify segments of customers or prospects with similar preferences or buying habits. A major challenge in clustering is that it is often difficult or impossible to know how many clusters should exist or how the cluster should look unsupervised. Learning is used against data that has no historical labels. The system is not told the right answer. The algorithm must figure out what’s being shown. The goal is to explore the data and find some structure within unsupervised learning works well on transactional data and these algorithms are also used to segment text topics, recommend items and identified data outliners. Now there are majorly two classifications of unsupervised. Learning one is clustering as I discussed earlier and the other one is dimensionality reduction, which includes topics like principal component analysis, tensor decomposition, multi damage to statistics and random projection. So now that we have understood what is machine learning and what are its various types of machine learning, let’s have a look at the waste component of SPARC ecosystem and understand how machine learning plays an important role here. Now, as you can see here, we have a component named ML lip now. Pi SPARC! Ml lip is a machine learning library. It is the wrapper over the PI spark core to do analysis using machine learning algorithms, it works on distributed systems and is scalable and we can find implementation of classification clustering, linear regression and other machine learning algorithm in Pis live. We know that Pi Spark is good for iterative algorithms using iterative algorithms. Many machine learning algorithms have been implemented in Pi Spock, Ma, Live apart from Pi. Spark’s efficiency and scalability. Pi Spark Ma. Lib Apis are very user-friendly software libraries, which are defined to provide solution for the various problems come with their own data structure. These data structures are provided to solve a specific set of problems with efficient options. Pi Spark em ellipse comes with many data structures, including tents, vectors, space vectors and a local and distributed matrix. So the major ML live algorithms include ML. Lib, we have clustering. We have frequent pattern matching. We have linear algebra. We have collaborative filtering. We have classification and finally we have linear regression. Now let’s see how we can leverage. Mlf to solve our few problems. So let me explain this use case to you. A system was hacked, but the metadata of each session that the hackers used to connect their servers. We have found now. These include features like session. Connect time the bytes transfer. We have the kali trees used. We have certain data like servers, corrupted pages corrupted the location, and we have the wpm typing speed now. There are three potential hackers to confirm hackers and one not yet confirm the forensic ingenious know that the hacker trades off attacks, meaning they should each have roughly the same amount of attacks, for example. If there were 100 attacks, then in a – hacker situation, each would have 50 attacks and in a three hacker situation, each would have 30 C attacks. So here we are going to use clustering, so let’s see how we can use clustering to find out how many hackers were involved so today. I am going to use the Jupiter notebook to do all my programming. Let me just open a new Python to tip it a notebook So first of all what we are going to do is import all the required libraries and initiate the SPARC session. Now next! What we are going to do is read the data using this part or treat method here. We are doing spark Dot read dot. Csv, as data set is in CSV format and we have given header and in Fátima as true Now here, the default location. What it takes is SD Fs. When we do this part or feed so in order to change the default location to your local file system, you need to provide file, colon and two forward slashes and then provides the absolute part of the data file, which we are going to read. Now, let’s have a look at the first record of the data frame and also the summary of the data set now to have a summary of the data set. We use the describe function here now. The output of this one is very half a search. So if we want to have a look at the names of the columns which we have here, we just need to use the data set columns. So as you can see, we have the session. Connect time we have the bytes transferred we have. Cali trace used service corrupted. Pages corrupted the location and the wpm typing speed now. Wpm stands for words per minute. Now next, what we are going to do is import the vectors and the vector assembler library. But these are all machine learning libraries, which we are going to use now. What Victor Assembler does is take a set of columns and define a particular feature, so our features consist of the session time the by stars for the Cali Trace used. We have the servers, corrupted the pages corrupted and the wpm typing speed. One thing to note is that the feature selection is based on us, so whatever we the feature selection if our model is not creating the right output, we can change the features accordingly to get the desired output now. I’ve created a VEC underscore assembler, which will take all the above defined attributes and based on that it will provide us the feature call. Now, what next we are going to do is make our final data and we’ll use vector assembler and transform it on the data set, which we have now next. What we’ll do is we’ll import the standard scalar library now centering and scaling happen independently on each feature by computing. The relevant statistics on the samples in the training set mean and standard deviation are then sorted to be used on the later data using the transform method. Standardization of the data set is a common requirement for many machine learning estimators. They might behave badly. If the individual feature do not more or less look like standard normally distributed data Now let’s compute the summary statistics by setting the standard scaler and then let’s normalize each feature to have a unit standard deviation now. Finally, we have the cluster final data, it’s time for us to find out whether they were two or three hackers so for that, we are going to use k-means here, so I’ve created K-mean’s free and k-means to gaming’s field will have all the features with the K value as 3 and the K-mean’s two will have the features column, which are the scaled features with the K value as 2 Now, what we’ll do is create models for these both k-mean’s tree and k-mean’s two variables. We are going to fit it into the cluster final data now. W Triple S II stands for within set sum of squared errors. So let’s have a look at the values of these for the model, which has K equals 3 that is three clusters and for the model, which has K equals two now for K equals Three. The set sum of squared errors is 434 and for K equals two, it is 601 Now let’s have a look at the values of K starting from 2 to 9 to have a look at the values of will in set sum of squared errors as you can see, the values are getting lower and lower. That means the probability of the number of hackers being more than three and four is very less as you can see for. K equals 8 is 198 now. The last key fact, the engineer mentioned, was that the attack should be evenly numbered between the hackers. Let’s check with the transformation and prediction column that the result for the now grouping by a prediction. We’ll see as you can see if we have The prediction of three hackers, the counters 167 79 and 88 which is not evenly distributed. So if we have a look at the data for K equals 2 with the Mon K 2 and do the prediction. So, guys as you can see here. The count is even a distributor. This means that only two hackers were involved. The clustering algorithm created two equal side cluster with K equals 2 and the count being 167 for each one of them. So this is one way through which we can find out how many hackers win involve reusing k-means clustering, So let’s move forward with our second use case. Which is the customer churn prediction now. Customer churn prediction is big business. It minimizes customer deflection by predicting which customers are likely to cancel a subscription to a service, though originally used within the telecommunication industry, it has become common practice across banks. Isps, insurance firms and the other verticals the prediction process is heavily data-driven and often utilizes advanced machine learning techniques In this most. We’ll take a look at what type of customer data are typically used to do some preliminary analysis of the data and trend rate churn prediction models all with Pi Spark and its machine learning framework. So let’s have a look at the story of this use case now. Our marketing agency has many customers that use the service to produce as for the client and customers. They’ve noticed that they have quite a few bit of churns in the clients, They basically randomly assigned account managers right now, but they want you to create a machine learning model that will help predict which customers will join so that they can correctly assign the customers most at risk to churn an account manager. Luckily, they have some historical data. So can you help them out? Do not worry. I’ll show you how to help them. So we need to create a classification algorithm here that will help classify whether or not a customer churn. Then the company can test this against the incoming data for future customers to put it which customers will churn and assign them an account manager. So let’s import the library’s. First, which we need so here, we are going to use logistic regression to solve this method now. The data is saved as customer underscore churn dot CSV, so we’ll use this part or read method here to read the historic data and then we’ll have a look at the schema of the data and understand. What exactly are we dealing with now to understand the schema of any particular data frame or the data we use the print Schema method. So as you can see here, Guys, we have name age. We have the total purchase your account manager years, the number of sites on board date, location company and the churn. So let’s have a look at the data. So as you can see here, we have data of 900 customers here, so I’ve used the count method to get exactly the number of rows to see how much we are dealing with. Now, Let’s load it the test data as well and now. I have a look at the schema of this data so as you can see. The test data is also in the same format as the training data. Next what we are going to do is import the Vector Assembler library. Now since I have already imported the Vector Assembler library here as you can see earlier in this, we have done from Pi Spot ML DOT feature import Vector Assembler. So I’m not going to import it again as it will show us some error now. Firstly, we must transform our data using the vector assembler function to get to a single column where each row of the data frame contains a feature vector now. This is a requirement for the regression. API in ML M. So as you can see here, I’m using age. The total purchase the account manager, the years and the number of sites I must say it’s dependent on the user that is creating the model so say that this model is not giving us the output as we want or we require, so we’ll change the parameters of the input columns. Now here what we are doing. We are creating an output underscore data, which is the data frame, which shall contain the data of the input data, which has been converted using all these input columns or the features and have a single output of the call of named features. But let’s have a look at the schema of this new output data. So, guys as you can see here. All the columns are the same, except that the last one the last we have an additional feature column, It’s a vector so this will help us in the production of the customer churn. Now in order to have a look at what we are dealing here. Let’s take a look at the output of the first element, so as you can see here in the last, we have the features, which is a dense vector containing all the five values of the column, which is 40 to age eleven hundred and sixty-six, which is the total purchase. We have zero point zero, which is a common manager. You have seven point two years and we have the number of sites, which is eight so now what we are going to do is create a final data. We’ll use this output data, which we have from the vector assembler and what we’ll do is only select the features and the churn. So if you have a look at the final data, so as you can see here, we have only two columns, which are the features and the churn so now what we are going to do is split our data into training and testing data for now we are going to use the random split method and we are dividing it in the ratio of 70 to 30 Now what we are going to do is create our logistic regression model, and now we are going to use the column churn for the label so now let’s swing the model using our training data, which is just now created from our final data, so let’s have a look at the summary of the model, which we just created. So as you can see here, we have the churn the prediction you have the main standard deviation, the minimum and the maximum value now that we have created our model, let’s use it to get the value of the evaluator on the raw prediction data now for that, we need to first import the binary classification evaluator. We are creating a date of M predictions in which will fit the test data into the model and evaluate using the binary classification evaluator. Now, when we have a look at the output data, we can see on the left hand side. We are the features then we have the churn. Then we have the raw prediction. According to our model, then we have the probability and finally the prediction, so let’s use the evaluator, which has the pilot classification evaluator. It will take the prediction column and the label column churn and tell us how accurate is our model, so as you can see a 77% accurate so earlier. I loaded the test data. I’ll show you here, you can see. We have the new custom start. Csv, so we used the original data split it into 70s to 30 ratio. Then we created a model and train it using our training data and then tested it using the testing data. So now we’ll use the incoming new data, which is the new customers and see if our model is fit or not now again, we are going to use the assembler. The vector assembler. Here now. We have created a data frame results in which it will take the logistic regression model and transform it using our the new test data and this new test data also contains the features column Because we use the vector assembler just before that. So if you have a look at the results, it’s very half as add here. I’ll show you another format. This is hang on a second, so what we are going to do is select the company and the predictions just to see how it’s working so as you can see the. Caron Benson prediction is true parent. Robertson prediction is true. The Sexton golden is also true in a park Robinson. Also so guys as you can see. Our model was 77 percent accurate. Now we can play along with the features column to see if our model produces a more accurate output. So in our case, if we are satisfied with the 77 percent ratio of the model being true, the prediction being true, there is fine, but then again we can change it, according to our preferences. So guys. This is it for Pi Spark ML Lib. I hope you understood a lot about the Pi Spark. How machine learning works in the industry wealth is being used. We have the supervised unsupervised reinforcement, learning and. I hope you got how to do programming using the ML live library of Sparc! Thank you so much! I hope you have enjoyed listening to this video. Please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest. Do look out for more videos in our playlist and subscribe to Ed Eureka Channel to learn more happy learning.