Transcript:
This video lecture will serve as a brief introduction to entropy as we use it in data. Science were borrowing the concept of entropy from physics, which is a concept that originated back in the 1860s You can see here. A picture of Rudolf Clausius, who came up with the second law of thermodynamics. What we’re going to use it for though in data, science is a measure of disorder in a dataset and then we’ll use a related concept of information gain to measure the decrease in disorder that we achieve when we partition our data set based on some additional attribute. So let’s take a look at the formulas that we’ll use here for entropy and information gain so entropy, which is always a calculation on a vector of categorical variable values, is going to be a summation across each of the possible values that that vector can take on. So, for instance, you might have a vector that has political leanings. Republican Democrat Independent. For instance, so you would for each of those groups you would say how many times does it occur in that vector, so if there are 38 Republicans out of 100 observations the probability of Republican, so I equals 1 might be Republican would be 38 divided by 100 And then you’ll do 38 one hundredths times, the log base 2 of 38 one hundredths and you’ll do that for each value that the Vector D can take on, and then you’ll calculate the sum of all of those calculations and you take the negative of that and the reason for the negative is that when you take the log base 2 of a probability, which is, of course, a number between 0 and 1 you get a negative answer and we want our entropy to be measured in a positive number, so we just take the negative of the result that way more disorder will be a larger, positive number rather than a more negative number entropy for a partition. Then all we do Is we partition the the vector D. According to some other vector a and then for each of those partitions, we calculate the entropy and then we take the weighted average of that entropy And so we’ll see an example of this here in just a minute. But with the Republican Democrat Independent example, you might men and women tracked in another vector a we would partition D. According to that other vector a and we would end up calculating an entropy for men and multiplying it by the weighted probability of picking a man and then we would take the entropy of women times the probability of picking a woman, and then we would add those up, so we get the weighted average of the entropies across the different partitions. The information gained then is just the original entropy minus the entropy from the partition. And that’s. What we would say is the amount of information we gain when we create the partition. So let’s take a look at a visual example of this so that we can start to get a feel for what’s going on here so here my target attribute is going to be star versus Diamond. I’m interested in being able to predict if I grab an object out of the box whether I get a star or a diamond and you can see, it’s a rough split. I’ve got about half stars and half diamonds. As a matter Of fact, I have 49 objects in the box and 25 of them are stars 24 of them are diamonds. You’ll also notice that we have color codes here. So not all of the diamonds are blue. There are some orange diamonds and some blue diamonds, and then there are also some orange stars and some blue stars, so our partition might be on color, so we’re interested in predicting star versus diamond our partition, then we’ll be on the color and so we’ll create an orange box and a blue box and then what we’ll say here is, it’s very, very much easier to predict whether it will be a star or a diamond in each of these two boxes compared to the original total data set, so there’s a lot less disorder here. So we would expect to have a significant information gain when we partition on color, so let’s see it how these calculations will shake out, lets. Just make a note here again. There are 25 orange objects, 21 of which are stars and there are 24 blue objects, three of which are stars, so we’ll calculate the full group entropy first. So the entropy is the sum of each of the probabilities times. Its log base two, and then we take the negative of that at the very end and so let’s plug those numbers in so again for Diamond. There were 24 out of 49 diamonds, so we do 24 49 times the log base 2 of 24 49 and then there were 25 stars, so we do 25 40 Ninths Log, Base 2 25 40 nights, Add those together, take the negative, and we get an entropy for the full group of about 0.99 97 for the orange box. Then let’s calculate the entropy just within that box. It’s the same process, but in this case again we have diamonds and we have stars. There are four diamonds out of 25 objects in the orange box, so we have four twenty fifths log base, two of four 25th and then there were 21 stars, so we have 21 2015 1 25th Add those together, take the negative and the entropy within. The Orange box was 0.6 3 4 3 which is quite a bit lower. There’s a lot less disorder again if I tell you. This is the orange box you reach in you. Choose an object at random, and you predict it’s likely to be a star. You have a much better chance of being correct in your prediction. The blue box works very similar again. It’s the same formula, so you’ve got 21 diamonds out of 24 so 21 24th log, base 2 21 24 You’ve got 3 stars out of 24 so 3 24 its log base 2 of 3 24th at those together. Take the negative and in the blue box. The entropy is actually 0.5 for 36 even less disorder in the blue box Because we only have those 3 diamonds, so we’ve got one less diamond. One less overall object, we get even a little bit better chance of predicting correctly, the combined entropy, then we’ll, just first remind ourselves the overall entropy, with 0.999 7 We’ll need that in a minute. The combined entropy is going to be the weighted average, so there were 25 orange objects out of 49 so we do 25 49 times the entropy of the Orange Group, and then we do 24 49 times the entropy of the Blue Group. Add those together, we get the weighted average of the Entropy X, and that comes to be about 0.5 8 9 9 So our information gained then is the original entropy, which is 0.99 97 – the weighted average of the entropies across the partitions by color, which was zero point five, eight nine nine, So our information gained is about zero point four zero nine seven. That’s a substantial information gain and again just visually to remind you what that means by partitioning on color. If I then know which colored box I’m in, it’s much easier for me to make an accurate prediction of what I’m likely to draw out, Whereas in the original box it was about 50/50 right, twenty five stars. Twenty four diamonds pick one at random. I’m almost as likely to get a diamond as I am. A star so partitioning on color makes it much easier for us to do that prediction of whether it’ll be star or diamond, and you can imagine this applied to a real world scenario, such as jury selection where for each potential juror, we might know whether they’re a man or woman, we might know religious preference. We might know what their occupation is and what we could do ahead of. Time is a survey in the community where we gather all of those attributes, and we also gather whether they would be leaning guilty or not guilty and then we can make an accurate prediction for each juror of whether they’re likely to be leaning, guilty or not guilty. You can see how that might be useful for the lawyers during trial selection during jury selection for trials, All right, so if they know that that conservative Christian mothers are much more likely to say guilty. If you’re the defense attorney, you’re going to try and remove that person from the jury, Okay, we do have a couple of notes to make here when we’re calculating the entropy again. Here’s just the formula, but what would happen if that? P is log base. Two pi turns out to be zero Log base two of 0 You know what happens if you, if you, for instance, at a box that was all stars and no diamonds. So what we do strictly speaking, you could just ignore that term. If there are no diamonds don’t have a diamond term in the calculation, but that’s not as easy to do in in R when you’re building these functions, so what you can do is just set it so that if you have zero log base, two of 0 by definition, make the equal to zero, and that will give the correct result, so your job for the assignment then is to implement these functions in R so that you can perform these calculations and you can see the programming assignment for further details.