Transcript:

Hey, there, how’s it going? Everybody in this video we’re going to be learning how to plot histogram’s, so histogram’s are great for visualizing the distribution of data where the data falls within certain boundaries. It’s a lot like a bar graph, But a histogram groups the data up into bins instead of plotting each individual value. So the best way to see what this looks like is to just take a look at some examples now. I would like to mention that we do have a sponsor for the series of videos and that is brilliant Dot work, so I really want to thank brilliant. We’re sponsoring this series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I’ll talk more about their services and just a bit so with that said, let’s go ahead and get started, so I have a little starting code here that you might recognize if you’re continuing from previous videos. But if you’re not, then let me give a quick overview of the code here. And what’s what’s going on, OK? So up here at the top of the code. I am importing pandas. I’m also importing pipe lot from Matplotlib. I’m using the 538 style 4 just to make our plots. Look, a little nicer and here is the data that I’m going to be using for this video right now. I just have a list of ages here between 18 and 55 I here’s some data that I have commented out in a CSV file and we’ll look at this once we get further along in the video and see how to plot out more data than just this small list. OK, so down here at the bottom. We are also creating a title for our plot. We have X and y axis labels. We have a tight layout, which just gives our plot some padding and also we are doing plot Dot show, which will actually show our plot now as usual. If you’d like to follow along, then I will have this code available on my Github, and there’s a link to that and its description section below. If you want to go there and copy and paste this into your editor so that you can follow along with this exact data and I’m also going to have the data CSV file that I’m using in this video as well. OK, so like I was saying, we’re first gonna look at how to do this using this list of data directly here in the Python script, and then we’ll look at a real word example with data that I’ll load in from a CSV file so first, let’s look at this small list of sample data, so let’s pretend that we took a survey and we track the ages of all the people who respond now. It might be useful to plot those ages to get an idea of which age groups are in our sample size. So how should we actually plot these? Well off the top of your head? You might think that a bar chart would be a good idea for this, but if you think about it, we possibly have up to a hundred different possible ages. Maybe even more, so if you plot it out, how many responses we got from each age? Then that would mean you’d have almost a hundred different columns which definitely isn’t useful. So this is where histogram’s come in. Histogram’s allow us to create bins for our data and plot. How many values fall into those bins so to see this? Let’s create a histogram of this list of ages that we have here so to do this. I can simply say BLT Dot HIST, and we will plot out those ages. Now if I ran this now, then it would give us a plot, but really, we wouldn’t know what bins it’s actually using, so I always like to pass those in manually and explicitly so that people know what those bins are. So when we specify bins, we can either pass in an integer or a list of values if we pass in an integer, then it will just may make that number of bins and divide our data into those accordingly, So for example, if I was to say, bins is equal to 5 then this will divide all of these ages up into 5 different bins and then tell us how many people fell into those age ranges. So if we run this, then we can see that we get a pretty simple histogram here and what this is is a distribution here now. I personally find it a bit difficult to read these. Sometimes if we don’t have edge colors for each bin because they all just kind of run together here, so I don’t know exactly how many bins there are here and here. I’m a guessing since we have five bins. It’s two bins here and three bins here, but let’s add in some edge color, so that’s more clear, so we can add those in by going back to our plot here, and I’m just going to pass that as an argument, so edge color is equal to Ill. Just say black. So now let’s run this and now we can see those bins a bit more clearly, so let me make this a little larger and also to where you can see the ages up here at the top and let me explain what this is actually doing. So we said that we wanted our data plotted on a histogram and we wanted that broken up into five different bins, so it calculated those ages for us so this looks like it’s between. Let’s see 18 and like 26 maybe, and then 26 to 33 and so on, but what this is telling us here is that there are four people in our ages here that fall between 18 to 26 and there are four people that fall between 26 to 33 and so on and then we just have one person in these higher age ranges for each of those bins. So if you pass in an integer for our bins, then that’s what we get, but we can also pass in our own list of values, and those values will be the bins and I like passing in a list of bins for this kind of data because you have more control over the exact values. So for example, let’s say that I wanted to plot the ages broken up into groups of ten year differences. So I could say right here above my plot. I’m gonna say bins is equal to, and I’m just going to say that we want to bend for 10 20 30 40 50 and 60 and now instead of passing in that, we want 5 bins. I want to say that I want to use this list as my bins. So now if I run this, then we can see that we still get five different bins here, but that’s only because we have six values here in our list. So it starts at 10 and then 10 to 20 20 to 30 30 to 40 40 to 50 and 50 to 60 so that is five bins total. So if I open this back up now, the reason I like using my own bins for this kind of data is because now it doesn’t have to try to guess where I want these broken up, so we can see that now we have from 10 to 20 It’s a lot easier to read. We don’t have to guess It’s like 26 or something like that. So we’re saying from 10 to 22 people and our ages list fell into that bin. There were 4 people from 20 to 30 three from 30 to 40 one from 40 to 50 and one from 50 to 60 So that’s how you plot and read a histogram and we can even exclude some data. If we don’t want to add those ranges to our bins, So for example, let’s say that we didn’t want to include the ages between 10 to 20 in my results. Well to do that, we can just simply remove 10 from the bin, and now 20 will be that leftmost value. So now if we run this, then now we can see that it’s not even plotting out the ages from 10 to 20 there, so this 19 and 20 Don’t even show up in our results here, so this is now just giving us our results for the people who fell into these age ranges between 20 and 60 Okay, so now that we’ve looked at this small example now let’s look at a real world example looking at some real data. So let me uncomment what? I’ve got here! Let me remove ages here, so I’m just going to remove that data that is directly in our Python script now. I’m going to uncomment out the data that I had down here. Let me cut that out and paste it here above our bins and our plot. Okay, so I’m loading in this data is V file and I’m using this Panda. Stop read CSV method to do this now. We’ve done this a few times so far in the series. But if this is your first video that you are watching in the series, then let me explain this really quick. So we are loading in this data. Csv, so what this does is it goes to this see data. Csv file here. So let me explain what this survey data is. So we have these responder. Ids and this is just an ID for each person who responded to the survey. So this is one person here. This is another person here. Another person here and then our age column here is just the age for the people who responded to this survey, so this person was 14 This person was nineteen twenty eight, twenty two and so on so back here, we have our. Id’s variable and we’re setting that equal to data. And then we are passing in this responder. I’d key so what that does is it sets those Ids equal to all of these Ids here that are in this responder. I’d column and here. We’re saying ages is equal to data age, and that is setting that age’s variable. They’re equal to this entire column here for our ages and the data that I’m using. Here are the responses from the 2019 Stackoverflow developer survey, so this is actually real data for people who answered that survey, so we have let’s see over 79,000 responses here in this data CSV file, okay, so let’s plot a histogram of the ages for this data set and see what age ranges most people fall into who answered this survey. So I’m going to expand the bins here a bit and I’m gonna say 10 20 30 40 50 60 we’ll also cover 70 80 whoops, 80 90 and let’s also put in a hundred there now since we called this ages variable here, the same thing that we had before we don’t even need to change our histogram plot because that is still just ages there so now. I should be able to run this and get some real data here from this or some results here from that data, so we can see here that based on this plot that almost 40,000 of the respondents were between the ages of 20 and 30 and almost 25 thousand were between the ages of 30 and 40 Now it might not look like we have data for 70 to 80 and 80 to 90 but it’s likely because there just weren’t many responses with those ages and compared to 40,000 responses for the 20 to 30 group. It’s just too small to show up, but I bet if I was to zoom in on these values here, then we will start to see something, okay, So here’s 70 to 80 If I zoom in here, then we can see 80 to 83 so there are some responses there, but they’re just being dwarfed by these numbers over here. Now, when you have certain values that are a lot more than your other values, then you can plot this on a logarithmic scale to montÃ³n to not make this look so extreme so to do this, we can add an argument of log equals true to our plot, so within our HIST method. I’m just gonna say log is equal to true. And now if I run this then. This is plotting this on a logarithmic scale, and we can see that now we do have that data visible for 70 to 80 80 to 90 and 90 to 100 so we actually had more people who responded to the survey that they were between the ages of 90 to 100 than the people who were between 80 and 90 So I think that’s kind of interesting there. Now sometimes you might find it useful to add some additional information within these plots as well. So, for example, let’s just leave the histogram how we have it for now, but let’s say that we want to plot a vertical line where the median age of all the respondents is. And I’ve got this commented out down here at the bottom here, so let me uncomment out this median age, and also I’m going to uncomment this color and this legend as well, so I went through and I calculated the median age of all of the respondents and it was 29 years old, so now let’s plot a vertical line on our existing plot with that age So to do that just above our legend here. I’m going to say Plt Dot ax V line. So I’m pretty sure that is stands for axis vertical line and we want that line to be plotted at the median age, and now let’s also. I want to add in a color here and the custom color. I’m going to add as this. I think this is just a red color that I grabbed and also let’s put in a label so that we know what this line represents and I’m just going to say age median. So now let’s run this, and now we can see that within our histogram. We now have this vertical line here. Which is the age median. So this plot tells us a lot of things. It tells us how many people are falling within which age groups who answered the survey and also where the median is for those survey results. And if you think that this line is a little bit thick and kind of obstructing the data. Anyway, then you can play around with how this looks so, for example. If you wanted to change the thickness there instead, we could say line width is equal to two If I run that, then that’s a little thinner there, so that’s basically what these histogram plots are used for. We can use these for dropping our data into these different bins and see how many values fall into these certain bins. So that’s what you would use a histogram for. Okay, so we are just about finished up here, but before we end. I’d like to mention the sponsor of this video and that is brilliant. Org brilliant is a problem-solving website that helps you understand underlying concepts by actively working through guided lessons, they have computer science courses ranging from algorithms and data structures to machine learning and neural networks. They even have a coding environment built into their website so that you can run code directly in the browser and that’s a great way to complement watching my tutorials because you can apply what you’ve learned and their active problem-solving environment and that helps to solidify that knowledge there are guided. Lessons will challenge you, but you also have the ability to get hints or even solutions. If you need them, it’s really tailored towards understanding that material, so they’re computer science material is fantastic, and I really like what they’re doing. They also have plenty of courses, depending on what you’re most interested in, so they have courses in different fields of mathematics or astronomy, solar energy, computational biology and all kinds of other great content so to support my channel and learn more about brilliant, you can go to brilliant Org /c ms2 sign up for free and also the first 200 people that go to that link will get 20% off the annual premium subscription And you can find that link in the description section below and again that’s brilliant Org /c m/’s. Okay so. I think that is going to do it for this video. I hope you feel like you got a good understanding of how to use histograms and also when it might be appropriate for different kinds of datasets. These are definitely nice when we have data like we did in this video, where we want to divide those ages up into different bins and get an idea of those age distributions because like. I was saying before you might be tempted to use a bar plot, but when you have a hundred ages like this, that means that we’re going to have a hundred little bars, and sometimes that just doesn’t tell you the information that you’re looking for. And these histograms are better suited for that now. In the next video we’re going to be learning about Scott Plots so scatter plots are great when we want to show the relationship between two sets of values and see how they’re correlated so for example, let’s say that we wanted to see how salaries were correlated with age or something like that. Well, we would probably assume that on average, we’d see higher salaries with higher ages, but to be sure we can plot that with a scatterplot and see what that data looks like. So definitely be sure to check out that video, but if anyone has any questions about what we covered in this video, then feel free to ask in the comment section below, and I’ll do my best to answer those, and if you enjoy these tutorials and would like to support them, then there are several ways you can do that the easiest ways to simply like the video and give it a thumbs up and also it’s a huge help to share these videos with anyone who you think would find them useful. If you have the means, you can contribute through patreon and there’s a link to that page in the description section below. Be sure to subscribe for future videos. And thank you all for watching you.