Transcript:
Welcome in this video. I’m going to show you how you can create a frequency table like the one shown here using Python. Specifically, I’m using Jupiter Lab and Python 3 You can check my website at my Youtube channel, which has a lot of statistics all for free. A frequency table in itself is just a table showing all the values with some specific types of frequencies. This is for those who are interested shorthand code that that I’m actually will go over step by step, so this might be somewhat longer video. This is then what the result will be, and that’s what we’re going after in this video to actually see how all of this works. I will first need to load some data and I’ll be loading that as a pennis data frame, so I’ll need pandas. If you don’t have panels installed then use, I think it’s BIP install pandas because I need to read Csv file just so I can get my data in. I use the I get a warning here. That’s because it’s a huge file that has actually been converted somewhat note. It’s just a warning, so it’s not an error actually, and it’s just a huge data file with all kinds of information in there, but in particular. I’m going to use them. Married columns from Mar One and two simply get a quick look of the counter. So that’s all you’re interested in. You’re quickly done. We can use the value counts from pandas, and that will actually give me for each category nicely, the frequency count. So this is actually a panda series, the my frequency table, so if you really want to see that as a panel’s data frame, but me nice to keep these separate and keep the value separate, so I’m going to be using keys to store the different categories and the corresponding values for the frequencies, and then I can actually use PD for panda’s data frame and then make this into one of these with name of the column. That’s going to be my keys and then the frequencies that’s going to be the values, and that gives you a nice looking table with just the frequencies if you like you can reorder D. So if they are, for example in a numerical order, you might want to have them sort it. As such, you can simply use sort values, but it might be useful then to also drop the reset the index so that also these will then simply be again zero to four. So if I run this and then show the result again, then you know that notice that these are again nicely zero to four, but these are now in in alphabetical order, you might want to add percentages to it, which is simply this one each of these divided by the sum of these times a hundred. So that’s exactly this calculation. I select the column divided by and here you can see now and it sum and then multiplied by a hundred to get a percentage otherwise their proportions, so we can say that about 50% is married. We can also add cumulative percentages the in this case, it wouldn’t actually make much sense because cumulative means to add up. And then we can say something like, what is it? 66% Is married or less that doesn’t make much sense, but if there’s a logical order in the categories, it would and, for example. SPSS will always show cumulative percentages. There is also a comes some option in actually pandas, so we can simply use that, and it will add nicely the cumulative percentages What I’m actually doing here is I’m doing the cumulative sum over the frequencies and divide, then again by the total of frequencies you can also, actually in this case do to come some function over the percentages directly. I’d like to avoid that, usually if if I can because there might be some rounding in the percentages. And then you get accumulated rounding errors in your cumulative percentages all right, now, if we take a closer look at that original variable and we’ve Len, we can see how many records that were actually in there and there were 1974 But if I sum up my frequencies, I only have 1941 so it means that there are 33 records actually missing. They weren’t gone. That’s because they were set to nuns. And those are ignored Independence frequency lift accounts to actually add them in. We can use a little trick, which is to put everything as a type of string, and then the nuns will also be converted to strings. So if I do that, so I call this more adjusted and I have a new column in my data frame called marital Mar adjusted, and now if I run the frequency table that one, you notice that that num appears and that actually has 33 so I can then do it again. The exact same calculations as before, so I’m going to go over all of that step-by-step again, and then we can nicely get the results with those missing values in there. Now, one thing is, though, that we usually don’t want to have the percentages based on the ones, including the missing values, so we need to find a way to also have the percentages excluding the missing values. And those are then at least in. SPSS called valid percentages. So that’s? A little bit trickier to do. That’s why this video took a little bit longer. I’m actually going to pretend just to show you how it works. That also separated was a missing value. It’s not, but just so you get the hang of how this might work. If you have multiple labels for missing values, then this will work what I do is I first put them into a separate list, so all the ones that are missing values in this case, it’s none and separate it separate it, not really, but okay, and what I’m going to do Is I want to create an array with zeros and ones if a category is missing or not, it is a missing category. I’d like to store those as an array. I could use Python’s own arrays, but I usually often you’ll see numb by being used for arrays, So I’ll simply use import Numpy. S&P might have to do Pip install. If you’ve never used Numpy before, and then I can actually feel that array first gonna make a copy or actually a second reference to my frequency table. I’m gonna create an empty array for now. So that’s this line and then I’m gonna check for each eye So for each category in my frequency table, If it’s actually in this one, so that way, I can actually check if it’s missing or not, I’m gonna assume at first that it’s not missing and then I’m gonna loop through my missing categories, and if it’s actually in there, then I’m gonna reset it to zero so by default. I assume that the category is not missing. I’m gonna check if it appears in mind. Missing categories list that was up here, so we can clearly see that, and if it’s if it finds it in there, it adjusts it to a zero then. I’m gonna add it to my missing array. And that should give me a list of ones and zeros showing whether or not something is a missing category or not, so let’s run that and indeed. I get an out a nice list with the last two being a zero, which are the missing values. I’m gonna scroll a little bit up just so we can see that that indeed at the last two were the ones that were missing values. All right, going further down the line, we can now actually use that that new array, this one and simply multiply certain results with that, so if I’m doing the valid sum, which is the sum of all the frequencies of categories that are not missing, I can simply do the frequency frequency, my frequency table frequencies column and simply multiply that with that array of ones and rose, and then I will simply multiply the frequency of a missing category by zero, and then I can divide the those frequencies by the or all the frequencies by that nous sommes are the sum of all frequencies of non missing values and then multiply that again with my missing array so that I’m ignoring the ones that are missing anyway and multiplied by a hundred so that should give me a nice list of at the last two are now excluded and I have my nice valid percentages and in total there, eighteen hundred and sixty two that are actually counting almost there. I don’t like those zero because it could actually be that. I’ve category that is not a missing category but or missing value but has zero percent or extremely close to zero. So I’m gonna do another look and then actually replace it back with a none if it’s if the category was assigned zero, so let’s do that, and finally I can make my new two columns, and I simply add dos to my frequency table, and that gives me my final result like this where I nicely have nones for my missing categories because they were not relevant. I’ve my valid percentages, and as you can see, those are different as mentioned before. In this example, in particular, cumulative percents don’t make much sense so also accumulated valid. Percents Don’t have much actually addition here. It might be helpful to do all of this in a single function. So I’ve here basically the same code, all of it. I’ll try to drop this. You put their notebook? In the descriptions, you copy it, and that should give me in one. Go my frequency table each time, and that’s kind of it. Actually, if I have missing categories, I can just add them and then it actually has those two knots seems to work in the appendix of this file. These are the references you can actually find a little bit of additional info if you’re using SPSS data and converted that using my other explanation on how to convert SPSS data, I have some useful functions made for myself at least in there where everything can be done in one go, but that’s beyond the scope for this particular video so that was it. I hope this was helpful. If you have any questions, let me know in the comments below. I’m just getting used to fighting myself as well a little bit so curious. What you found out what you think of this video hope? It was helpful.