Transcript:
Hi, my name is. Connor Murphy and I’m a data scientist at data bricks in this video series, we’ll be walking through an end-to-end machine learning pipeline using Apache spark and we’ll be doing this on data bricks, a unified analytics platform that enables data, science and data engineering teams to run all analytics in one place from ETL to model training and inference at scale In the rest of this video. We’ll be walking through a machine learning at a high level. Then we’ll be walking through why we might want to use spark for machine learning, and finally we’ll get started on spark, so we’ll get started in a data. Brics environment start a cluster and run a little bit of code. Let’s start with what is machine learning at a very high level machine. Learning just refers to an ensemble of different tools that we use in order to learn something from our data and so under the hood we’re using a lot of calculus and linear algebra, but at the end of the day, machine, learning allows us to take a step beyond summary statistics and basic analytics and learn a lot more from our data and make that actionable as well so more technically, machine learning learns from data without being explicitly programmed. So if we think about the history of engineering, we oftentimes associate coding with a number of different conditional if-else statements and a lot of hard coded logic, what machine learning allows us to do is abstract away a lot of those difficulties and use calculus and linear algebra in order to fill in the gaps and learn from our data without coders having to conditionally write all of those statements use cases for machine learning run the gambit they include fraud detection where we might be detecting a fraudulent users on a website. We can a be test websites and figure out what is the optimal version to be showing our users, we can do natural language processing or use algorithms to understand natural language. We do image recognition for self-driving cars. We can also do financial forecasting and even churn analysis where we try and predict which customers are going to leave and win now. Why would we want to use spark for machine? Learning there are a number of different reasons, but first and foremost spark comes into play when we need to scale and so spark solves the big data problem by using a network cluster of machines rather than one single machine we’re able to operate on more data that can fit on any one computer. This allows us to scale theoretically infinitely up to the scale of gigabytes and even terabytes and petabytes worth of data generally speaking the more data we can throw in our models, the better they perform and so. If you take a look at this graph, we see that as we add more and more data to a number of different models, they continue to perform better and better in the x-axis is on a logarithmic scale, so even up to 1 billion words were still seeing an improvement in our models, so generally speaking if I had to choose between a highly optimized model or more data, I would generally pick more data than finely tuning a model, so in addition to the question of scale, Sparc works really well with a number of different pipelines first and foremost, if you’re already using SPARC, whether that’s for ETL streaming ad hoc analysis or anything else, Sparc machine learning is going to play really nicely with that framework, additionally. Sparc works well with SK learn tensor flow and Hora Vaad, which provides distributed tensor flow, additionally. Sparc works with a number of different languages, including Python sequel, our Scala in Java, and so depending on the assets and knowledge base for your team members, each can be writing Sparc code that executes on a distributed cluster of machines. Finally, Sparc works for model training and production. So you can easily deploy the models that you train into production, so let’s get started on a spark first. What we’re going to do is we’re going to sign into data. Brooks, and we’re going to import the notebooks that we need so now that I’m into data bricks, I’m going to go ahead and click on home in the upper left hand corner of this screen and on this drop down menu. I’m gonna click import here. I’m gonna import from a URL. I’m gonna include this link in the description and here. I’m going to import a DBC file, which is just a zipped version of a bunch of different data, bricks, notebooks and so here if I go into the machine learning folder and then click machine learning you can follow along with all of the work that I’m doing the other thing I need in order to get started. Is some sort of cluster to be running this computation, so on the left hand side of the screen. I’m going to go ahead and click on clusters and click on create cluster. You can give your cluster a name. I’m gonna go ahead and call this my first cluster. You can pick your data brick’s, Runtime version 4.3 which should be just fine, but you can pick whatever the latest version is. You can also change your. Scala version here as well in Python 2 is fine. This code will work just fine in Python 3 as we’ll now go ahead and click on create cluster, so that’s going to take a few moments to spin up, and so in the meantime. I’m gonna navigate back to my notebook now that I’m back in the notebook in the upper left-hand corner of the screen, I’m gonna go ahead and click detached and I’m gonna choose my cluster and we can see by this icon on the left that it’s still spinning up now. I’m just going to go ahead and scroll down and I can make new cells by clicking this plus sign. And so first, let’s go ahead and get started with Python, so I can add this percent Python if I want. But this notebook is by default Python, so I don’t necessarily need that, and so here I can execute arbitrary Python code, so I’m just gonna say X is equal to 3 I’m gonna go ahead and print that out. Now with Ctrl enter. It’s going to execute that code. And now we see that we’re writing Python code. I could just as easily do this in Scala so I can use the percent Scala sign. Then I’ll say Val X is equal to 3 and Ill. Go ahead and print that out as well and now you can see the same and finally I could do the same thing with an AR or sequel as well and now to use the SPARC API, I can go ahead and create a new data frame. Let’s call it D F and the easiest way to create a new data frame and SPARC is to use the range function, So let’s just call it Sparc range of 10 So now I have this data frame. DF and I can use the built-in data. Brooks function display in order to print it out. So now we can see we have our first data frame in spark. So now we have our spark code up and running in our data bricks environment in the following videos, we’ll walk through exploratory analysis feature ization, the actual model training and then we’ll go ahead and save our model in our predictions. Thanks for watching.