Transcript:
Let’s see how to use the CSV and data frames packages in Julia to download and load in data. So we’ll use this data set on COVID-19 coming from Johns Hopkins University and the first thing we need to do is to actually download the data itself. So we’ll define a variable called URL and we will download the data from that URL from that web link into a CSV file coviddatacsv. So we’re specifying the file name where we want to store the data in my current directory as the second argument to the download function as a string. Now we need to read in that csv file. Remember that csv stands for comma separated values and Julia has a package That helpfully is called CSV itself with all capital letters. Csvjl is the name of the package and so we’ll add all of those packages and use them. So the two that we’re going to focus on in this video are CSV and data frames. When I load a csv, I get this data table that we have seen in the other video or data frame. But how do I actually get that? Well, inside the csv package. There is an object type called file and we’re creating one of those objects using this command CSV dot file and we pass in the actual name of the file that is stored on disk. I’ve given that a name and then I’m going to transform that into a data frame and so these two lines basically will read in the file into this object in the CSV format, and then I’m going to convert that into a data frame object from the data frames package so that I can then manipulate it in the way that I would like. As we mentioned in the other video, a data frame is a standard way in Julia of manipulating and storing heterogeneous data. So data where we can have different types for the different columns in the table. So let’s see how to actually take that and process it. The first thing we notice from this table is that the titles of the columns are maybe not that useful. For example, province/state country/region lat and long instead of latitude and longitude. The first thing that we would like to do is to turn those names into something more readable. So let’s do that. We’re going to use the rename function. So data is my original data frame and I’m actually going to create a new data frame called data_2 using the rename function. This takes a sequence of pairs. A pair in Julia Is this object written with equals greater than which is representing an arrow. So this means map one to the word province. In other words, take column one and rename it to Province take column two and rename it to count and then I’m using the head function from data frames to just look at the first few rows of the table. And you can see that, indeed. The names of the columns have got changed correctly. There’s an alternative. Instead of using this rename function we can use rename. This exclamation mark is often pronounced bang in the context of computer programming. And in Julia. What that means is that this function actually modifies. Its first argument. So since its first argument is data what we’re doing is actually changing the original data frame. So there are some circumstances where this is a good idea, especially if it’s a big data frame, and you don’t want to copy everything. And there are some circumstances where you might want to retain exactly the original form of the data when you read it in to be able to compare later. This corresponds to what we call out of place or in place operations. Out of place is when we copy all of the data over and in place when we modify the data. So now let’s extract some information. The first thing we want to do is to get a list of all the countries. So we need to access the country column of this data frame: the column, which has the name country. There are several ways of doing this in the package. One of them is data. Colon comma, the string country. So we can think that data is sort of like a matrix. The data frame is sort of like a matrix and we’re going to use slice notation, just like we did for matrices. Colon here means all of the rows of the data frame and we’re choosing just the country column. Another way of writing that is that you’ll see. Is data colon colon country. Colon country is what’s called a symbol that will also return the same result as we see here. I’ve called it all_countries2 and let’s do all_countries_3, which is data. Colon comma two. That will extract the second column of the data frame, and that gives me the same result. Of these different options for extracting data, of course. If we want only a certain set of rows, then we can also specify that. For example, data, let’s choose row, five through row, eight and column two, and that will give me the countries only in rows five through to row eight. Now it turns out that in this data set, some countries are divided into provinces. And so when we have this list of countries, there will be some repetitions. For example, Australia is repeated several times and so we’re going to use the Julia Command unique to take that list of countries and return a new vector, which lists each country only once and now we can use a slider in Pluto to scroll through and just look at all of those countries. Instead of a slider, we could also instead use the select function and that will actually give us a drop down box where we can choose which country we would like. Now, suppose that we actually are interested only in the information of one country to start with, for example, the United States. We need to actually find how that country name is represented in the data set. For example, my home country, the United Kingdom. The official name for the country is United Kingdom, But sometimes it’s also called Great Britain or England, depending on the context. So let’s look at the countries that start with the letter “U”. One way of doing this is with an array comprehension. So we’ll use the startswith function that takes a string does. “david” start with the letter “D” and the answer is it does, whereas if I ask, does “hello” start with the letter “D”, the answer is that it does not Let’s extract just the countries that start with “U” and we’re going to do this by iterating through the set of all countries and spitting out those which start with a “U”. We get back a vector of boolean’s false or true, and if we scroll through that we will see that it tells us for each row in the data table, whether that particular country starts with a “U” or not, Then we can use that to filter the data table. So we want to extract only those countries that start with “U” and here they are. There are 17 rows corresponding to those countries, But as we see United Kingdom, for example, has several different territories associated to it and so the first information. Is that the correct way to spell “us” in this data table is indeed “us”. It’s not “USA”. It’s not ”United States of America”. We need to refer to it as “US”. An alternative way to run this filtering operation is literally to use the word filter. This is a function in Julia. A higher order function, there’s a separate video about higher order functions and we’re passing in an anonymous function to select Just the United Kingdom in this case. What happens if we want to extract just the row corresponding to the United States? So one way of doing that is to use the findfirst function. So here I’ve used a different way of writing A comparison to “us”. What I’m doing here is looking through all of the countries. This is a vector and I’m saying. Please find me the first one where the result of that country is equal to the string. “us” and Findfirst returns the index into the vector, so it returns the number of the element where “us” is found for the first time. And in this case, it’s in row number 243, and now once. I have a single number. I can index into just that number row, and that will give me a row of the data frame back. Just be aware that if instead, I index with a range, which is a single number. So this is, you know the range 243— up to 243, If I index in that way, I actually get back a whole data frame. It’s just a data frame with a single row. You see that we are having different types being returned from those two operations. So if we want to extract a row, we really need a single row index. Now let’s try to extract the data into a Julia vector. We can do it with just vector of the data and the data we need to remove the province country, latitude and longitude. So we only want the elements from the fifth column onwards, and so once we’ve done that we actually get a vector of just integers. It’s also, of course possible to do many operations directly on data frames and to plot data frames directly. Now, let’s also look how to process the dates. So if we use the names function provided by data frames that gives me a vector of strings and those are the column names. So the column names are returned as strings, and that includes these dates, which are returned just as a string: you know, 1/22/20 means January 22, 2020, So we have to interpret those column names in the right way. In particular for the dates. We want to actually convert those strings into real Julia Date objects. The way we do that is, firstly, we will extract the strings into a separate vector called datestring. So this is just the columns corresponding to the dates. And now we need to parse those p-a-r-s-e. What does parse mean? It means take a representation of something as a string and interpret it and make a representation in an actual Julia object form. So one of these dates looks like this. What we’re going to do is use the Julia Dates library, which is a standard library and that provides an object called Dateformat. We specify the format in which the date is written. So, in this case, we know that the month comes first and then the day and then the year and there’s a particular format that you can go and read about in the documentation as to why I have a capital letter Y here. And so what happens when we parse this first date? Well, we see that we actually get the wrong year. Why, because in the original data set, all that’s written is 20. It’s implied that this is the year 2020, but it’s not actually written explicitly, and so the parsing gets that wrong and the way we fix that is actually by manually changing the date and to do that, we can just add this year object that also comes from the Dates package and so what we’re going to do is parse element by element the vector of date strings with the correct format and we’re parsing it into a date object. And when we do that, we get a vector of date object that we can then pass into plots as we saw in the other video.