Transcript:
Hey, everyone, this is my introduction to their analysis, slash data visualization with Python in this video. I’m gonna cover why you might want to use data visualization and why you might want to use Python and map lot live for it, and then we’re gonna go over some simple examples of how to actually use these tools and then using these tools we’re gonna do sort of a real analysis with a real data set at the end and in this video. I’m only gonna cover line charts just to keep everything simple. I’m also gonna put a more detailed version of this outline in the comment section below. So you don’t have to watch the whole thing. If you don’t want to okay, so why should you use data visualization in the first place? Well, error visualization is actually often the first step of any type of data analysis work whether it’s simple there, analysis or statistical analysis or machine learning analysis and the reason for that is because visualizing data often gives you an intuitive understanding of the data and it often helps you see patterns that are otherwise hard to see, and we’re gonna see an example of that later. Okay, and why should you use? Python for this well? Python is not the only good choice, but I would say it’s one of the best, and the reason is first of all. It’s a general-purpose language that’s pretty easy to use and learn. And it also has many libraries for scientific computing and data science, including Mapple Lib. And if you work at a company, your company might already use Python for something else, and if that’s the case, that’s really nice because then you and your team are not gonna have to learn a totally new language to do some data analysis. And why are we using? Mapple live for this well. Mapple live is not the only good visualization library for Python, but it’s still one of the most popular choices, and there are actually other libraries that are based on Matlab. So if you learn MATLAB, it’s gonna help you learn these. You know, other libraries, For example, this one called C born later on if you want to. And Mapple Lib is also pretty easy to get started with anyway. Let’s dive into a demo for this demo. We’re gonna use something called Jupiter notebook and a few other Python libraries and we’re gonna use Anaconda to install them. If you’re not familiar with Jira notebook and Anaconda, I have an explanation about them in my Python tutorial video, so I’m gonna leave a link to that in the description anyway. To install Anaconda, just search for anaconda Python or directly go to Anaconda Org and there find the button that says download anaconda and select whatever Os you are using. I’m using Mac here and click download under Python three-point-something version instead of Python 2 point something because we’re gonna use Python 3 here and select where you want to download this package, save it and once it’s downloaded open up the package that you just download it and then just click continue continue. Continue, continue, agree, Install for me, only you or install on the specific disk. It doesn’t matter which one and continue and click install and this process is gonna take a while after some waiting. You might see this prompt to install Microsoft vs. Code. We don’t need that, so let’s just continue here and then close and then to launch Jupiter notebook, you can do it through this thing called Anaconda Navigator. So just launch it like you launch any other application. Just dismiss whatever comes up and then click launch in that Jibra notebook section, and then you should see a browser window show up with the Jupiter notebook interface now. If you want to follow this tutorial, the first thing you should do is you should create a new folder. Let’s say on desktop and let’s call this one data visualization, and we’re gonna put all our data and to put a notebook file here. So let’s first download our data to do that, just go to. CS DOJO, DOT IO slash data and download these two files, sample data, CSV and country’s that CSV and then put these CSV files in the folder that you just created data visualization after that go back to the Jupiter notebook interface and you can just navigate to desktop and then the folder that we just created data visualization and to create a new jabber notebook file here. Just find the new button on the right and click Python 3 right now. This notebook file has untitled as the title, so let’s change it to data visualization with Python. Click rename. And you have a notebook called data visualization with Python. You can check it just by going to desktop and then to the folder that you just created and you should see that there’s a file called data visualization with python dot. I Pi and B and it’s really important that this notebook file is in the same folder as the data that you just downloaded countries that CSV and the other one, and once everything is set up just right in the first cell import pandas as PD. This means we want to import a module called panels as P D, or we want to give it sort of a nickname. And that’s going to be P D. You can run the cell by clicking this button. And now pandas is imported as P D and here we’re gonna use pandas for importing and using some data from our CSV files, and we need to import another module here, so for that, just right from my plot. Lib Import Pi Plot as PRT. So this says from the Matplotlib package Import Pi pop module and then call it. Plt, let’s run this cell. And now Pi plot is imported. We’re gonna use Pi Plot from Apple lib for making our charts so here, let’s first take a look at a really simple example of how to use Pi plot so here I’m going to write X equals 1 2 3 It’s a list of 3 elements and Y equals 1 4 & 9 and to plot this set of data. You can just write PLT the plot, X comma Y and this plots X on the X axis and Y on the Y Axis. And then you can show this graph by writing PLT -. Oh, when you run the cell, you should see a graph like this. You see that the values of X our one two and three as expected and the values of Y are 1 4 & 9 If you want to add a title to this graph, you can do so by writing. Plt Dot title tests plots right after the plot statement before the show statement. And then you can add an X label and the Y they bow as well by writing PLT dot X label. Let’s call the X label X and Quixote Dot y label. Let’s call it y label. Y here, and when you run this cell, you see that there’s a title called Test plot and the X label called X and why they both called Y, okay. What if you wanted to plot multiple lines here? Well to do that, let’s create another list. Let’s call it D. And this one is gonna have ten five and there inside and to plot X and D on top of X and Y you can just right PLT the plot’s X comma Z right after PR T dot X comma Y and then let’s fix the Y level here to Y & Z. And when you run this cell, you should see these two lines. So the blue line represents X and Y, and the Orange Line represents X and Z, so PR T dot plot X and Z Plot X on the X Axis and Z on the Y Axis, But right now it’s kind of hard to tell which line represents which data so we can fix it by adding a legend statement. Let’s add that after the Y level statement by writing Quixote Dot Legend Parentheses square brackets double quotes. This is y comma Double quotes. This is Z, so note here that this legend function takes a list as an argument. And when you run this, so you should see this legend that says the blue line is. This is y and the orange line is. This is d okay. That’s the basics of body. Now, let’s see how to load data from a CSV file for that. You can just write sample and the score data equals D or pandas that read CSV by the way I just press tab here to do autocomplete. And then parentheses sample underscore data dot CSV. Now, before you run this cell, make sure that the notebook file data visualization with Python that I Pi and B is in the same folder as sample Data Dot Csv. When you run this cell, this data sample data dot CSV is loaded by the panda’s module, which we call PD. And then it’s assigned to this variable called sample data. You can check what’s inside. This variable sample underscore data just by writing sample underscore data in this new cell. And then when you run this cell, you should see something like this, so as you can see. This data has three columns column a column. B and column C and five rows. And you see a bunch of values inside this table. If you want to check if this set of data is exactly the same as the original data you can do so by opening up the original data file sample data dot CSV with Excel or any other spreadsheet application. And when you open it, you should see exactly the same data column, a column B column C with five rows with a bunch of values. Okay, the only difference that you might see. Is that in Jupiter notebook, You might see these numbers. Zero one, two, three and four and these are just indices for the rows, and you can check what type this variable is by writing type. Parentheses sample underscore data. And when you run this cell, it says that this is Pandal’s daccord, a friend data frame. So this is a data frame type. That’s defined by the panda’s module and the data frame type is used to contain a table like piece of information. Just like this one okay now. What if you wanted to plot data in this data frame? For example, the values of column a on the X-axis and Column C on the Y-axis. What to do that you need to be able to retrieve a specific column and you can do that by writing sample. Underscore data dot column dot C. Column underscore C. When you run this cell, you see that a column see it’s retrieved. It has the values 10 8 6 4 and 2 and the numbers you see on the left are just indices 0 1 2 3 & 4 Just like before you can check what type this is by writing type parentheses sample data column C. And when you run the cell, you see that this is Parnell’s duck or that series that series, so this is basically a series type that’s defined by the panda’s module and it’s a type that’s used to store a series of values, for example, these values 10 8 6 4 & 2 now. What if we wanted to retrieve a specific value out of this series well? If you want to retrieve, for example, the second value here 8 you can do so by writing sample Data Column C Dot, I’d lock I LOC Square Brackets 1 and this retrieves the second value of the series 8 and if you want to retrieve the third value 6 you can write. I lock 2 and that gets the third value, and if you want to retrieve the first value you can write. I lock 0 and this should give us 10 and it does OK, And using what we’ve just learned here, we’ll be able to plot the data in this data frame, So let’s say we want to plot column A on the X-axis and Column B on the Y-axis. We can do that by writing. PRT DOT Plots, sample data dot column, a comma sample data dot column B and we can show it by writing PLT that show. Let’s see how it looks. We have 1 2 3 4 and 5 on the X-axis and on the Y-axi’s We have 1 4 9 16 and 25 as expected. If you want to add a column C to this data, you can write PRT DOT plots, sample data dot column a so let’s use column a as the X-axi’s again and the sample data Dot Column C. When you run the cell, you see that there are two lines here just like before. If you want to make this graph a little bit easier to read, you can add a titles and a legend. And by the way in this plot function, you can use the third argument to change how the plot looks so, for example, if you give it O in a string as the argument in the first line for column B and when you run the cell, the plot becomes dots instead of just a line and there’s a lot more you can do. You can find more about it in the official documentation. Anyway, let’s move on and do sort of a real analysis with a real data set now for this analysis we’re gonna use this data country’s dot CSV. It should be in the same folder as well. And when you open it, you should see this data. So we have a bunch of countries and a bunch of ears ranging from 1952 to 2007 for every five years and population for each year for that country and you can see that there are a lot of rows in this data, so let’s now import that data just like before by writing PD or pandas that read CSV Parentheses, single quotes or double quotes countries Dot Csv and by the way this is a string single quotes country’s dot CSV and in Python, you can use either double quotes or single quotes to express a string. Let’s assign that to a new variable called data by writing data equals. And when you run the cell. This data is loaded onto data. So once you write data in this new cell and run it. You should be able to see this data in a data frame. Now let’s say that the analysis we want to do here. Is we want to compare the population girls in the US. And China now to do this analysis? The first thing we want to do Is we want to isolate the data for the US. And China? We can do that for the US by writing. US equals data square brackets there that country EKOS United States in single quotes. And when you run this cell us now only contains the data for the United States, So let’s break down this statement A little bit more. Let’s click insert here and insert cell bill when you write the other country equals United States. This actually gives a series of a bunch of choose and forces, so when the roll is not us, this gives us false and when it is us, it gives us true. We don’t see any cheese here, but there are a bunch of cheese here where the rows are for the US. And then when you right there, a scrub buckets, this a series of bunch of trues and falses. This gives us a portion of the data where the value of the series is true, and that’s the data for that us as you can see here, and then we just assign it to this variable called us. Okay, let’s now do the same thing for China by writing China EKOS theta Square brackets that are the country equals China. And when you run this so and when you write China here and run this cell, you should only see the data for China and using these two variables, US and China will be able to compare their population growth. So let’s first plot US population here by writing PLT dot plot us, dear comma us. Top population. You can show this plot with. Tlt does show and when you run. This cell used to see this graph. You see that? US. Dollar is party on the x-axis and US. The population is plotted on the Y-axis. But you see this scientific notation thing 1e8 because the numbers are so big, so let’s divide the whole population each number in the series, with 1 million, or 10 to the power of 6 That’s 10 Star Star, 6 in Python. And when you run the cell again, you now see the population in millions, so this is 160 million and it goes up to. I think more than 300 million in 2007 and lets plot China’s data on top of the spot by writing PLT dot China that year. Actually, you could use us that year or China that year because we have exactly the set of ears, but for now let’s just use China deer for the X-axis and China dot population for the Y-axis and were gonna divide this by 1 million as well to make the population show in millions. When you run the cell, you should see these two lines. Let’s add a and titles here to make this graph easier to read so PLT legend parentheses, square brackets, United States and China and the X label PRT da hex label should be just here and PRT DA Y label should be population. Run this cell again. And this graph is much easier to read. So you can see that. China’s population started out much larger than the US in 1952 And it seems like it’s going faster as well now. What if you wanted to compare instead of the absolute amount that you see here? The percentage girls from the first year that we have in our data 1952 Well, there are several different ways of doing this. But I’m gonna show you just one way so to do that. Let’s first copy this whole block of code over here. Now, let’s say that for each country, we want to find the percentage girls from the first year, so we want to set the first years amount to 100 as a 100% and show the rest of the data in percentage relative to the first year. And we can do that by dividing this whole series, for example, US stop population with the first year’s population and then multiplying everything by 100 so to show you what I mean, let’s just create a new cell here above by clicking insert cell above here and here first, I’m gonna write us that population, and you see a series of population here for each year and the first row you see. Here is the first year’s population or the population in 1952 I think let’s insert a new cell below here now to retrieve the first year’s population, you can just write us. Top population the Eyelock Square brackets 0 and this gives us the first year’s population, which is this amount then we can divide the whole population this whole series by the first year’s population just by writing us the population divided by us, the publishing dialogue square brackets there and this gives us this series so as you can see, the first year is set to 1 and the rest of the years are shown in relative amounts. And if you multiply everything by 100 just by writing, start 100 here, you’ll be able to show everything in percentage amounts, so you can see that the first year is shown as 100% and from 1952 to 2007 which is the last year we have. The population grew by 90 percent Now, like I said earlier. This is not the only method to show the relative girls in population. But I chose this method here because it’s pretty simple to implement Anyway. Let’s copy this whole thing and paste it over here to replace the y-axis. Let’s do the same thing for China as well. So copy the whole thing for China here and then replace us with China. Once you do that, let’s change. The population, my label to population girls and let’s just write first year equals 100 just for clarity here. When you run this cell, you should see this graph so you can see that even in percentage amount. China’s population grew much faster than that of the United States. The US. Population grew by 90 percent from 1952 to 2007 but during the same time, China’s population grew by more than 120 percent. Okay, this was a pretty simple example, and it actually came from my course called introduction to data visualization. If you liked this video. I’d actually highly recommend it. It’s a course with more videos just like this one, and I cover more realistic and complex examples and more different types of data visualization techniques. Not just line charts. So if you want to check out the course you can just go to. Cs dojo da. Io / More data. You can actually take this course for free. By signing up to plural sites 10 day free trial, that’s the site. The course is hosted on anyway as always. Thanks for watching this video. And I’ll see you in the next one.