Transcript:
Hello, and welcome to another. Ilab’s tutorial today we’re gonna be exploring how to create histogram’s using the ggplot2 package. Histogram’s are a really great way to eyeball the data, a certain values more frequent than others and is this distribution, symmetrical skewed or even bimodal. Today, We’ll be introducing the basics of creating histograms using the object Giome – histogram within GGPLOT2 and then we’ll be manipulating certain characteristics of the histogram. Mr. Graham, like making it multiple panels adding color and, of course, tidying up the graph to make it look pretty finally we’ll also introduce how to produce density plots. Initially, there seems to be a lot going on when you produce a GG plot, but it’s quite simple when you break it down into the small properties, such as the GG plot object that allows that you add to this object and then the global options that you can change. For more information on the specifics of these individual components, Please see our other videos on creating box plots and scatter plots using the ggplot2 package. OK, so let’s go ahead and move over to our studio. So today we’re going to be using. A data set from Ernest’s 2003 Paper and Ecology, which looked at the life history characteristics of non-flying mammals, in particular, we’re going to plot histograms of the gestation period of different mammal groups, OK, so download the data from the website given or from the link below, save this file into a certain directory on your computer and then set the working directory of our to that to this directory. Now you can read in using the read dot. Csv function the data just using the deadest name without having to type in the Darracq directory for a second time and let’s have a look at the columns that we have in this new data frame. We have information about the specie’s names. We have information about mass gestation period in months and even lettuce, which is the average number offspring, the the species generally produces now. I know from previous experience that this dataset includes some strange values in some of the continuous variables. The authors used the convention of having minus 999 in the data to fill in cells. Whether isn’t a value this can be easily excised from our data by by indexing rows where gestation is greater than zero and mass is greater than zero and litter size is greater than zero. Now we’ve performed this basic cleaning task. We can go ahead and run. The first course histogram. And here you can see that we’ve got gestation along the the X and got a count along the Y, and we can see that for all mammals of our data set, we’ve got this skewed distribution. Okay, so let’s go on and just manipulate. What are the simplest characteristics which is changing the bin size here you can see, we’ve made the bin size a little bit smaller than the default setting and we can. We can play with this to make it smaller or larger as we see fit next. We can go ahead and turn subset this plot and produce many many panels, according to the different mammalian orders to do this. We just add the facet – grid and specify that we want the rows to be, according to the order of now groups and we’re not going to create panels, according to the columns. So from this, we can see that there’s a lot going on, and we can’t quite see this data very clearly, Because in some cases, certain orders have a lot of data, whereas other orders are more poorly sampled to overcome this, we can maybe focus in on the groups that are best samples, so let’s work out which those are, so let’s use the X tabs function and this will give us the sum of observations each mammal order and we might want to zoom in and focus in on from this, the ungulates can fools and rodents. Okay, so to do this, let’s create a new data frame where we’re interested in just those three orders of mammal so to do this, we can index our old data frame we’re using the indexing again through using the square brackets and we’re going to specify rows that have orders equal to ungulates or so we’re using the conditional statement. Carnivores or rodents and we’ll take all the columns of this. We’ve assigned this new data frame to be called my data dot large orders, so where we have the largest samples and we can now use the same. GG code again. But this time pass to a this subsetted data frame. Okay, and this is time to look really interesting now so we can see for ungulates that there’s a seemingly symmetrical distribution for carnivores. There’s maybe even a bimodal distribution and for rodents potentially, we’re still getting this skewed this left skewed distribution so we can go on further than this and we can. Maybe try and investigate. Why carnivore has this bimodal distribution to do this? We might want to add another variable to a facet to see if that variable can explain this bimodal distribution. Let’s try and see if other factors affect just gestation period so to do this, Why don’t we look at the number of offspring the litter size? If we’re interested in having this as a new set of panels to our facet, then we should really create a categorical variable of litter size. So why don’t we do this for species that have a little size of one or a little size of many to do this? We can use the vectorize function. If else we can call our data and we say if size is greater than one, then we’ll return one in litter, and if it’s not greater than one will return many in litter, this is then going to return some strings and we’re going to make sure these strings and set as a factor. Okay, so let’s go on and and you the samples that we have for this three by two grid and we can see automatically. It’s a good job. We’ve done this because there’s very few examples where rodents have a litter of one, so we may want to before we proceed with this. Drop out rodents from our visualization and just concentrate on unlit sand carnivores for litters of many or one to do this. We’ll just take our data frame and exclude rodents from it. Okay, and now we’ve done all this messy data handling. Let’s jump back into it, and let’s create a new histogram ggplot. But this time we’re going to add into the facet columns that are sorted by litter sizes. So as you can see, we’ve got a two-by-two panel and in columns, we have the categorical variable numbering letter, so either one in the little or many in the letter, and if we then zoom in on what’s happening in the carnivals, it’s really interesting for a litter of many in carnivores, they take a relish or gestation period of maybe three or four months, whereas for carnivores with just one in their litter, they seem to take generally much longer gestation period, seven or eight months. Okay, so this is telling a really interesting story and we’ve done this through the use of exploring our data with histograms, but if we want to go on and make these histogram’s publication quality, we might want to start adding in some more specifics to the theme option we can go in here and start changing the main grids we can change in the panel background we can change in the the size of the text that is displayed on our figure, and although it seems like a lot of code, it’s actually really quite simple. We could also go one step further and add some color to our different categories of data so to do this. I’ve created a a vector called my color. I’ve given four color values in there that I’m going to assign to each of our four categories so angulars that have many in the litter and ungulates that have one in their litter and carnivals that have many or one in the litter and then at the very end of this. I added one small line of code. Which was the scale -? Phil, – Manual and I specified that the values are equal to my color, So I’m passing in this vector that I created just prior to running the Ggplot code, and I quite like this addition of color because the two red colors showed that these data are both from carnivores and the two green colors show that these data are both from Angelos and the different intensities of color maybe reflect the number of offspring. I’m now going to talk extremely quickly about an alternative to hit two histograms and that is using the density plot this instead of giving a count or frequency will scale the or scale the y axes or plot, according to the proportion or percent that the data is distributed. And it’ll even add some smoothing to the figure so in this case. I’ve dropped out the fastest tool together, and I’ve plotted them all together on one on one single panel whilst this looked quite pretty and is is initially attractive. Some people have some doubts about this this type of graphing because it seems to be B in some way, hiding the viewer from the real data. That’s underneath it, okay, so in conclusion. I’ve introduced you to the GM. – Histogra’m object that we can apply to a GG plot. We’ve also explored how we can use the facet – grid option in GG plot. Here we can manipulate the argument with it, which is a kind of formula where the first input is changing. The panels sorted by rows and the second argument is changing the panel construction, according to columns and finally to sharpen up the figures and make them a little bit more jazzy. We’ve added many many small options within the theme function. Thanks for listening to this all. ABS tutorial video. If you found this use, it’s possible that you might be interested in our free online. Moodle course, or check out and subscribe to our Youtube channel with many interesting playlists about data handling statistics and modeling.