Transcript:
[MUSIC] Quest, hello. I’m Josh Starmer and welcome to stat quest today. We’re going to talk about Gaussian naive Bayes, and it’s going to be clearly explained note. This Stack Quest assumes that you are already familiar with the main ideas behind Multinomial naive Bayes. If not, check out the quest, the link is in the description below this stack quest also assumes that you are familiar with the log function, the normal or gaussian distribution and the difference between probability and likelihood, if not, check out the quests, the links are in the description below. Imagine we wanted to predict if someone would love the 1990 movie Troll 2 or not, so we collected data from people that love troll 2 and from people that do not love Troll 2 We measured the amount of popcorn. They ate each day. How much soda pop they drank and how much candy they ate. The mean for popcorn for the people who love troll 2 is 24 and the standard deviation is 4. And a gaussian or normal distribution, with mean equals 24 and standard deviation. Equals 4. Looks like this. Likewise, the average amount of popcorn for people who do not love Troll 2 is 4. And the standard deviation is 2 and that corresponds to this gaussian or normal distribution. Now we calculate the mean and standard deviation for soda pop for people that love Troll 2 and draw the corresponding gaussian distribution. Then we do the same thing for the people that do not love Troll 2 Lastly, we draw the gaussian distributions for candy. Gaussian naive Bayes is named after the gaussian distributions that represent the data in the training data set. Now someone new shows up and says they eat 20 grams of popcorn and drink 500 milliliters of soda pop and eat 25 grams of candy every day. Let’s use Gaussian naive bays to decide if they love Troll 2 or not. The first thing we do is make an initial guess that they love Troll 2 This guess can be any probability that we want, But a common guess is estimated from the training data, for example, since 8 of the 16 people in the training data loved Troll 2 the initial guess will be 0.5 so well. Put that up here, so we don’t forget. Likewise, the initial guess for does not love Troll 2 is 0.5 so lets. Put that here so we don’t forget. Oh, no, it’s the dreaded terminology alert. The initial guesses are called prior probabilities now. The score for loves Troll 2 is the initial guess that the person loves Troll 2 times the likelihood that they eat 10 grams of popcorn, given that they love Troll 2 note. The likelihood is the Y-axi’s coordinate on the curve that corresponds to the x-axis coordinate and we multiply that by the likelihood that they drink 500 milliliters of soda pop given that they love Troll 2 times, the likelihood that they eat 25 grams of candy given that they love troll 2 the initial guess that someone loves Troll 2 is 0.5 The likelihood for popcorn is 0.06 The likelihood for soda pop is 0.004 and the likelihood for candy is a really really small number note. When we get really really small numbers, it’s a good idea to take the log of everything to prevent something called underflow. The general idea of Underflow is every computer has a limit to how close a number can get to zero before it can no longer accurately keep track of that number when a number gets smaller than that limit, we run into underflow problems and errors occur, so we use the log function to avoid underflow note any log will do, but the natural log or log base E is the most commonly used log in statistics and machine learning, so we take the log of everything and the log turns the multiplication into the sum of the individual logs. The log base e of 0.5 is negative, 0.69 The log of 0.06 is negative 2.8 The log of 0.004 is negative 5.5 and the log of this really, really small number is negative. Hundred fifteen. Now we just add this up and we get negative. One hundred twenty four, so the log of the loves troll two score is negative. One hundred, twenty four bam. Now let’s calculate the score for not loving Troll 2 We start with the initial guess that someone does not love troll 2 times the likelihood that they eat 20 grams of popcorn, given that they do not love troll 2 times, the likelihood that they drink 500 milliliters of soda pop times the likelihood that they eat 25 grams of candy, so let’s plug in the numbers and take the log of everything, and that turns the multiplication into the sum of logs. Now we just do the math, and we get negative 48 and since the score for does not love, Troll 2 is greater than the score for loves Troll 2 We will classify this person as someone who does not love. Troll 2 double bam note when we look at the raw data. It almost looks like we should have classified this person as someone who loves troll 2 After all, they ate a lot more popcorn than the average person who doesn’t love Troll 2 and they drank as much soda as the average person who loves Troll 2 however, the big thing is that they ate a lot more candy than the people who loved Troll 2 and the log of the likelihoods for candy are way different and this difference is what made us classify the new person as someone who does not love Troll 2 In other words, candy can have a much larger say in whether or not someone loves troll 2 than popcorn and soda pop, and this means we might only need candy to make classifications. We can use cross validation to help us decide which things, popcorn soda, pop and or candy. Help us make the best classifications. Shameless self-promotion! If you don’t already know about cross-validation, check out the quest. The link is in the description below triple bam! Oh, no, it’s another shameless self-promotion! One awesome way to support Statquest is to purchase the Gaussian naive Baye’s stat Quest Study guide. It has everything you need to study for an exam or job interview. It’s seven pages of total awesomeness. And while you’re there, check out the other stat quest study guides. There’s something for everyone. Hooray, we’ve made it to the end of another exciting stack quest. If you like this stat quest and want to see more, please subscribe. And if you want to support statquest, consider contributing to my patreon campaign, becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below. Alright until next time quest on.