Transcript:

Hello, everyone and welcome to the distribution plot’s lecture for Seaborn in this lecture we’re going to discuss different plot types with Seaborn that allow us to visualize the distribution of a dataset. Let’s go ahead and jump to the Jupiter notebook to get started okay here. I am at the notebook. I want to get started by importing Seaborn and by convention, we import Seaborn as SN S. And since I’m in the notebook, I’m gonna go ahead and say map plots lib in line that way I can see our visualizations inside of the notebook. All right, now, let’s get some data to plot. Seaborn actually comes in with a few built-in data sets that you can directly load and I’m going to grab one called tips and save it as a data frame called tips. You can do this by just saying tips is equal to S and S load data set and then pass in tips as a string and this will load the tips data set, and then I can actually check the head of the tip’s data frame and it looks something like this. There’s seven columns here, and this is basically just data, referring to people who had a meal and then left a tip afterwards, so you have the total price or bill of the meal. How much they left is a tip, the gender or sex of the person, leaving the tip whether or not they were a smoker what day and time they ate their meal at and then the size of the party. All right, let’s go ahead and discuss our first plot type. Which is this the plot? D is T plot. This plot allows us to show the distribution of a univariate set of observations and univariate is just a different way of saying just one variable. Let’s go ahead and explore this. I’m going to say SNS. Stop this plot and then for this plot, What you do is you just pass in a single column of your data frame in this case, let’s go ahead and see how the total bill is distributed, so I’m going to say a total bill and then run this cell and you should get a plot. That looks like this. If you get a warning here, don’t worry about it. That actually has to do with another package called stats models. It won’t affect your actual Seabourn code, but here we don’t have any warning, so we’re. Okay, notice here that I get basically a histogram. And what’s known as a KDE. A kernel density estimation. That’s this line here later on in this lecture. We’re going to discuss what this KDE is and how we can actually build it up, but for now we can remove it if we want to by saying as an additional argument here. Kde equals false and just by typing. KDE equals false. Now you essentially just have a histogram and a histogram is essentially just a distribution of where your total bill lies. So you can see here that on the y-axis you have a count and then you have these bars on the x-axis as bins, and this basically means that most of your total bills are somewhere between $10 and $20 and if you want to get a little more information out of this, you can change the number of bins so you can go ahead. And there’s a third argument say bins and then study appropriate number of bins and this number really depends on your data set, but let’s go ahead and choose 34 now, and now we can get a little more basically definition, and we can still see that there’s. Most of the bills happen between 10 and 20 If you choose a bin value, that’s too high. For instance, let’s go ahead and put in 100 you’ll start to get kind of a weird scenario where you’re essentially beginning to plot every single instance of total bills for every single price point. So usually we want to try to find a balanced bin size, But that really depends on your plot itself. OK, looks like we have a good idea of the information here, and if we can read this graphic and basically just say most of the bills happen somewhere between 10 and 20 dollars and begin to fade away as you get higher and higher in bill price, that’s the distillate, and that allows you to visualize a distribution, essentially a histogram and you can add a kte plot on top of that, but we’ll learn about Kde plots later on. Let’s talk about joint plot and joint plot from Seabourn. I can say SNS joint’s plots allows you to basically match up to Dist. Plots for bivariate data, meaning you can essentially combine two different distribution plots and bivariate is just two variables, and we also have a kind parameter that going to play around with which allows us to choose how we actually want to compare these two distributions. Let me go ahead and show you how we can use. SNS joint plot. First you have to pass in in X variable, and then you have to pass in a y variable. And then you have to pass in your data set. Let’s start from the back end, so we’ll pass in our data set as tips. So that’s our data frame and then for X&Y, you just pass in strings That are column names, the two things you want to compare it to each other. So, for instance, maybe. I want to compare the distribution of the total bill versus the tip size lets. Go ahead and do that. I’m going to say total bill as my X and on my y-axis I’m gonna put in tip the tip column so right now. I’m just passing in the total bill column, the tip column and then the data equals tips and I get a plot that looks like this, which is essentially just two distribution plots here. We can see the tip on the Y Axis and total bill along the X Axis. I’m going to zoom out so we can see the whole plot and then in between, I have a scatter plot and the scatter plot actually basically makes sense because it looks like it has a trend that as you go higher in total bill, you will go higher in tip, and that makes sense because tips are usually proportionate to your total bill. Now, joint plots actually give you an additional argument. Parameter called kind, and this kind parameter allows you to affect what’s actually going on inside of this joint plot right now, my default, it’s scatter, but you can also pass in an argument such as Hex and hex allows you to make basically a hexagon distribution representation. It’s similar to scatter, except basically, if the hexagon has a certain number of points in it, it gets darker, and if it has less points, it gets lighter, essentially it’s just a way of not having to put all those scatter points on, but instead showing a distribution with these hexagons, another argument we can put in for kind is our eg, which stands for regression and this will look a lot like a scatter plot except born is actually going to draw a regression line on it. Now we haven’t actually learned about linear regression yet as far as a machine learning topic, but later on when we do approach that topic, we’ll come back to this and actually discuss how this line is built, But essentially this is just showing almost like a linear fit to the scattered point data and you can actually see it has a P value and a Pearson coefficient, which we’ll discuss later on when we actually discuss linear regression. Finally, another kind that you can put in here is KDE, And that allows you to have this two-dimensional KDE, which essentially just shows you the density of where these points match up the most. Alright, let’s go ahead and move on from. Joint plot will usually be using joint plot with the default scatter, because that’s the one that’s essentially easiest to read, and it gives you quite a bit of information right off the bat. We’re going to go ahead and expand that idea by showing you pair. Plot and pair plot is essentially going to plot pairwise relationships across an entire data frame, at least for the numerical columns, and it also supports a color hue argument for categorical columns. Which I’ll show you later on, but we see here on top that we have this joint plot. What pair plot is essentially going to do is do this joint plot for every single possible combination of the numerical columns in this data frame. Let me go ahead and show you what? I mean, because it’s going to do it for all the combinations. Basically, you just have to call. SNS top pair plot and pass in your data frame and this is something we’re going to be doing quite a bit throughout the course keep in, mind the larger your data frame. The longer pair plot takes so a lot of times pair plot takes a while. If you have a very large data frame, this data frame is relatively small, so we’re ok and note here. We basically have a pair plot for all. The numerical column values, so we have size versus total bill size versus tip. And then when you get to a parameter vs. Itself, for instance, size versus size instead of actually doing a scatter plot, which wouldn’t make sense. Since you just have a straight line, You see a histogram instead and same thing for tip versus tip and for total bill versus total bill, That means pair is a really nice way to quickly. Visualize your data and what’s even nicer is that you can add a hue argument to this Hu. II and the hue argument is where you would pass in the column. Name of a categorical column and categorical means not numerical or continuous but actual categories. For instance, the Sec’s column is categorical because there’s two categories in it. There’s male and female. And when you pass this in as hue you pass in the column, name hue equals sex and it will color the data points based off of the column. You put in for hue so here. All the green points are female based on this legend and all the male points we’re going to zoom out so we can see the whole thing. All the blue points are male and as a third argument, you can specify a palette and the palette allows you to actually color this with some specific color palette we’re going to discuss palettes and color and style at the very end of the Seabourn lecture series. But right now I’ll just show you An example, essentially there’s these color map strings that are for a matplotlib that you can pass in as palette and they will choose certain colors for whatever the parameters are and here we can see now. Mail is blue and female. Is this kind of light pink color? Alright, we’ll touch in on palettes and colors and styles a lot more. Let’s go ahead and move on to rug plots and rug plots are actually a very simple concept, but we’re going to use the concept of a rug plot to actually build. I am explain what the KDE plots we saw earlier were. I’m gonna go ahead and say SN s rug plot and just like this to plot the distribution plot. You’re going to pass in a single column here, so we’re gonna say tips and let’s go and pass in the total bill column and what the rug plot does is. It’s a very simple concept. It just draws a dash mark for every points on this uniform or univariate distribution, essentially one single variable So instead of like a histogram. Let me go ahead and make that this plot one more time to compare. I will say SN s. This tip’s total bill. Run that and let’s go in and say. Kde is false, okay, so the difference between a histogram here below and this rug plot is that the histogram essentially has bins and it counts How many dashes were in that bin and then shows it as a number up here, and that means there’s between like 10 and I’m 11 There’s about if we take a look at this 45 dashes there, they’re all kind of stacked on top of each other, and then over here as we go further in total bill price, there’s less rug or less dashes, and that means the bin is gonna be less high. That’s the basic relationship between this histogram and this rug plot again rug plot really simple concept. You just draw a dash mark for every single point along the distribution line. Alright, that’s the total bill. What we want to do is build off of this idea of rug plots to explain what this actual? Kde plot is, and that’s going to be this line right here. How do we actually build this line based off of these rug plots? And you can see that it kind of has a relationship to the rug plot count. Kde plots stand for kernel density estimation plots. And you can actually Google this and check out the Wikipedia on kernel density estimation plots and the page will look something like this kernel density estimation. And this is a really if you scroll down, this is a really nice figure right here, and this is essentially we’re gonna try to construct. You’ll notice that each of these black dashes here is the rug plot, so the actual points. And then you have these little normal. Gaussian distributions on top of each point. And then you sum them all up, so you get this final current city kernel density estimation. Now, what do I mean by normal distribution or Gaussian distribution? Well, if you also look up on Wikipedia, let’s just in probability theory, the normal distribution and it’s a probably the most common continuous probability distribution. Essentially, it’s these kind of normal distributions. Where you say like oh. How did everyone do on their test? And you grade all the students and then see the distribution of scores. It’s usually something normalized like this, or, for instance, people’s ages or people’s. Heights, usually a lot of things tend to follow a normal distribution. Okay, let’s go ahead and jump back to the Jupiter notebook and touch upon these topics in a little more detail in order to do this. I’m going to copy and paste some code from the notebook, and you don’t need to worry about understanding this code. It’s just to build out a diagram for explanation. I’m gonna go ahead and copy and paste this. I’ve copied and pasted this code and let me break down real quick. What this code is doing? I just have a few imports. I create a data set of random data. Then I use a Rug plot on that random data. I set up the x-axis for the plot. Use NP linspace to create 100 equally spaced points from X-mens X Max, and then here this is probably the hardest part to understand because it uses a library. We haven’t talked about yet. That’s not norm. All this does is it plots a normal distribution for each of the rug plot points and that looks like this if we go ahead and zoom in on this here. I have my data set and this is a random data. Set, so if you run this year’s may look a little different, but keep in mind. We’re not look working with tips anymore. We’re just working of some random data notice. I have blue dashes here. And then these grey lines represent normal distributions on top of each of these blue dashes. So this is a normal distribution centered around this dash, and we have a bunch of them on top of each other, but we’re going to go ahead and do next is sum them all up to get the kernel density basis function and this is just the sum of all of these little normal distributions, all right, copying and pasting. The second block of code from the notebook allows us to actually sum up all these basis functions, which are just seasonal distributions. Once you sum them all up, you get something that looks like this, which is just the KDE plot from before, and that’s how the Kde plot is constructed from the Dist plot. The very first plot we looked at the IST. PLO T all right, So those are all the major ways you can show distributions of data. We have c-more. Let’s go ahead and quickly review all the various plot types, the distribution plot types. We scroll back up. They were the Dist plot and again that this plot you can use it. We have two methods have. Kde equals false and essentially just see a histogram or leave this blank, and then you can actually see the KDE, the kernel density estimation, which kind of explain at the end, it’s just the sum of all the normal distributions around a rug. Plot joint plot is really similar to this idea, except you’re passing in two columns and you pass them in as X and Y arguments with your third argument equal to the data. Then the next plot we learned about was the pair plot and the pair plot is just building off of the joint plot and essentially is a joint plot for every single column or numerical column in your data set, And that means you just pass in the data set itself that data frame and you can pass it in to hue and palette. If you want actually color by a categorical column next plot we learned about was rug plot. You usually won’t be using rug plots, but it’s there for you. And the main idea of using a rug plot is to kind of build the logic of the kernel density estimation plot, which is done through this code. Here, you can take the time and read through this code, but I just wanted to get the point across that when you’re using a rug plot, and you want to build a kernel density estimation plot off of that the KDE plot you can do that just by saying rug plot, pass all these normal distributions onto each point and then take the sum of all those points and that’s the kernel density estimation plot and we’ve seen how we can do that using this plot and as a quick point. If you are using this plot here, we know that we can get rid of the KDE plot by saying Kde equals false. If you actually just want the KDE plot and don’t want the actual bins here, you can actually pass in instead of this the plot you can do. SNS KDE plot and then pass in tips total bill and this will build the this the KDE plot without any distribution of the bars. All right. Hopefully you realize that. Seabourn is incredibly powerful and also very simple as far as the code. You need to write. Everything we did was just done in one line. If you try to do this in matplotlib, it would have taken you multiple lines, but what’s nice about? This is that it works off of what you know of matplotlib and we’ll see that a lot more when we talk about styling and colors, a lot of that matplotlib knowledge is going to be transferable to actually editing Little things in this plot, OK? I hope you’re beginning to enjoy see born again like I mentioned before. It’s one of my favorite libraries, and I can’t wait to show you The next couple of plot types. We’re going to learn about with Seabourn. Thanks, everyone, and I’ll see you at the next lecture.