Transcript:
What’s going on, everybody? Welcome to my first pass through the gaggle data Science Bowl 2017 challenge. So what my aim here is to do is just just show you my first pass through dealing with this data, seeing what the data is trying to model and just seeing kind of what happens so. I think for a lot of people just jumping into a challenge, right of the gate or any of these competitions can be really difficult because it’s just there’s so much you have to digest in a really short amount of time also, a lot of times. The data is so big and it’s not pretty data so a lot of times. If you follow tutorials, even a lot of my tutorials, the people who make them try to keep the datasets as simple as possible, so we don’t spend a whole lot of time cleaning and processing all that because that’s not exciting, but when it comes to actually do real-world stuff, that’s what you got to do so anyway, so that’s what this tutorial is going to be focused on. It’s just getting through the data we will apply it. We’re going to do a 3d convolutional neural network against this data and just just do what happens. So this is my real first pass. It’s going to be Rod, There’s going to be errors, Probably using the code and all that feel free to correct me, feel free to improve anything. You want all that my goal is that anybody can follow along. We’re going to be covering a lot of concepts. A lot of topics here. I can’t go into depth in all of them. We’re going to be using pandas, Matplotlib OpenCV tensorflow and probably a few more than I’m forgetting. If you don’t know, those are not familiar with them or whatever that’s totally fine. You can ask for help also. I’ve got tutorial series on all the things I just listed and for everything. We’re going to cover here. I’ll have I do have tutorials already out for all those things. So if you’re confused or whatever you can check those out you can go to. Python programming net and just search for them or something also as we. At least the first times we hit any of these libraries. I’ll try to just bring it up one more time and tell you where you can find it more information on them. So there we have it. So now if you are not or if you’re new to cattle you can go to. Kaggle comm create an account head to competitions and the way to these work is pretty much The same across the board. We’re going to be working in data science both 17 but I’m just going to use this one for an example, because most of the challenges seem to be more like this. The data science poll has one little wrinkle. That’s a little different. We actually a couple but so first. I’m just going to reference this one because this one’s a little easier for me to illustrate to you How cattle typically works so you can click on the challenge and like these are the paid challenges right now that they also have other competitions going on, They’re pretty much always running a digit one always running. Titanic it just it just keeps, you know, cycling through basically, and so not all of them are paid to paid the actual ones that have prizes are just they’re going to be more competitive. They’re much more challenging to actually win. Okay, so but sometimes they have other things going on so for example. I’ll point out everything with the data Science Bowl, But first let’s just talk about how these are typically structured. So you’ve got this Nature Conservancy Fisheries monitoring challenge. We come down here and I’m not going to read the whole thing for you. But basically, the task here is to identify the fish that these these people in the boats are catching. That’s the task so cool, you can well come to the data last, but you can go to kernels and you can see, basically kernel fancy word for script. This is just scripts that people have written. You can see what the language is. I think they just have are in Python. I’ve never seen like Java or C or anything here, but I’m not positive on that, but it’s primarily Python, so you can see that people have both scripts or ipython notebooks. Both of them are acceptable for being classified as a kernel. So so. Yeah, there’s that, and so you can check those out just for examples. Basically, what we’re going to be covering here what? I’m going to be running you through is going to be a kernel. So those are your kernels. Also discussion board? This has great information. You would be doing yourself a huge disservice. If you don’t read the discussion board, there’s a lot of really great things that people just just openly share off of the discussion board that will help you out. Then you’ve got the leader board. This is how you can see how people are doing in the competition. The competition is typically scored will go to overviews and evaluation. A lot of them are scored on logged loss format. So for example, this one has eight different classes of fish, but like for the one that we’re going to be doing it’s working with cancer data. So it’s basically this either there’s cancerous or not, so that’s a binary thing, but your your actual model is not necessarily just all wrong or all right, Models working in in like degrees basically gives a percentage of you know what it thinks something is the case It’s not just yes. This is no, it’s not the only time we get that is when we apply an. ARG, Max function, right, and then it becomes that, but before that right before we do the ARG Max to get the yes or no answer, we actually have a percentile, usually so anyway. Log loss so that the person who is most closely fit to the ash to true scenario, that’s generally the most fair way of going about it. There’s problems, but that’s that’s that’s the way that also the other, the actual data science. Bowl is is evaluated. I think other prizes and just see so in this case. First place is fifty thousand all the way down to fifth place gets ten thousand, and then you can see the timeline and all that. OK, so we’re going to go and now we’re going to look at actually want to show one more thing the most important thing the data, so most of the competitions are shaped like this. You’ve got your basically. This is a sample submission, so we’ll talk about that last. So actually we’ll start with train. So the training data generally will consist of your data and the Associated labels. So this is what you’re going to feed through supervised machine learning algorithm in most cases, You can say here’s the data. This is the class, here’s. The data here’s the class data, a class level. One you could you fit and then you get to test? Test is just the data no classes. So what you do is your model you brought through the test data and you put you create their. You predict the classes, and you predict them out to a CSV file that has usually like an ID prediction I’d prediction Id. Prediction and so on and the sample submission is just an example just to kind of show you that this is what it should like when look like when you upload it, and then finally, when you are ready when you’ve created this file, you submit your predictions and then you’re immediately put to the leaderboard. You can see how you in general, you get like two or three submissions a day, so you can’t do too many because you could just you can actually make a model that fits the predictions, so as you, you submit one prediction, Okay, you change some things you submit a second prediction, and if you keep doing that, you can actually just fit that and cheat. So that’s the problem, so now coming to competitions, Let’s get talked about the data science. Bowl amount. So this one is a little different. There’s a million dollars of prizes, and this one also just has different prizes. So basically, you’ve got, you know, First place is five hundred thousand, all the way down to tenth place. You also get five thousand for the top. Three most highly voted kernels. I’m going to be submitting this as a kernel. So if you found it useful, please do give me your vote. There’s also other kernels that you definitely will want to check out just like kernels and discussions, same thing. I said before, go to those and look through them. You’re going to find really useful information in both of them that people just kind of share freely and then also ten thousand dollars in basically sharing the stuff. Basically, those are just announced as time goes on right now. It’s a social media thing. So if you use or hashtags, you’re just automatically entered. Okay, so that’s that let’s see. I’ll get to the data last. You can read the description, just kind of understand what’s going on. But basically, we’re looking at CT scans calf gains for these low-dose CT scans to basically see if someone has a cancerous tumor or not, so basically, we’re looking through this data. That is their CT scans, but there’s many of them so basically, the CT scans actually stack on top of each other and actually create a 3d rendering of whatever we scan in this case. The chest cavity. So that’s the data that we have, but we haven’t quite yet seen it yet, so we’ll get there again. The evaluation is log loss. You can come here for the resources. There’s really only two resources there. There’s the time line, this one’s a little different. There’s two stages so the first stage. You’ll have you do have testing data that we don’t quite yet have the answer to. And then we also finally will have right, so you you’ll you’ll create your first submission on the data that we have right now, and then you’ve got the test that will be released after the deadline and the answer to the validation set, which you have right now are released and then you’ll have the test set, so you’re basically, you’ve got to compete in both stages to be eligible, but these two stages kind of make it probably even harder to cheat and in the actual leaderboard now, even if you even if you do cheat the leaderboard, you would not win the competition because you do have to share how you did it so so you would be found out and you would lose your number one position anyways, But it can be demoralizing to other people, If if someone, it’s just, you know, way ahead of everyone else. Or if a bunch of people are cheating, it’ll give you it. You know, it’s just a wall. I think so you think you didn’t win or whatever. So I think we’re ready to rumble. That’s just my, you know, quick introduction to to kaggle what I’m going to leave you off with is getting the data. There’s a bunch of information here. Basically, you’ve got the actual data which you need a password to open. The passwords contained in this file. You need the sample images. You need stage 1 the stage 1 labels and just this is a sample submission, so what I found to be best was to download the torrent because it just downloaded so much quicker, but you, it might not for you. But it is 67 gigabytes. It’s a big file so and then by the time you extract everything, It’s about 140 gigabytes, so that’s a very long download, so that’s going to take a while, so go ahead and get that started. If you don’t have space for that data, that’s also totally fine. I’m going to do my best to structure this in such a way that you can follow along just if you go to. Colonels and go new notebook right here. You should be able to follow along with me through this entire tutorial. Your model is not going to be good, but because you’re going to be working with a really small data set. If you do it this way, but you can still follow along and still learn a bunch so anyway, if you don’t have space for it or you don’t have a computer, that’s going to be happy to process that much information. You can feel free to just follow through this notebook here. You’ll only have to change like two things, basically as we as we go. Well, more than two things, but like, like five things or something not much. Okay, so that’s what this series is going to be all about. Get that data downloaded whenever you’re ready. I’ll see you in the next video.