Transcript:
Hello, guys, welcome to my Youtube channel and today. In session with Sumit, we are going to look at how we can load a huge file, so it happens sometime that you get a csv file or something. Some kind of file that is so much huge. And when you try to load it into your python, Maybe using Jupyter notebook, you get some kind of error and very frequent error. Is your memory issue, right, so what we will do? We are going to look at how we can load a sample of that file so that you can do a quick analysis, so we’ll try to, uh, use the normal panda’s library and we will try to load the file without loading it into Ram. So this is a kind of a trick that I am going to teach you today and this will make sure that we have loaded a a random sample of my complete file so that we can do a quick analysis on top of it. And, uh, let’s get started. So, um, first of all, let me just load my pandas so pandas as PD. So we just need panda’s library and also one more thing we need random, Uh, so because of so random, so we need a package known as random and so let’s load random. Now, uh, let’s let me declare the file name. So I have this file over here. This file is telco. Underscore churn dot CSV. So let me just declare it. Telco underscore John Dot CSV and even let me run this. Also now what we’ll do, uh, we will try to load the file using the inbuilt function of python, so Python provides a and bent function that can help you to load the file as a normal text as a raw string, so that function is known as open. So if I write, let me write open and let me create an object for it, So let me create object. R R is equal to open F. So I’m basically trying to open this file. Which is your telco Channelcsv. Using this inbuilt functionality open and I am loading it in this object known as R. So let me run it. And now I will just try to read it so once I try to read it. You can see that. The complete file is loaded as a normal string now. The complete file is a string and you can see that how huge this file is right, so this is just a sample data, but you can use any any big file or any use file that you are trying to load, so we don’t need to do something like this like. We don’t need this. What well do we will use this functionality and we’ll try to? I will try to find out how many rows this particular file has, okay. So what well do we are going to write a a list comprehension, okay, and not a list comprehension, but a comprehension, so number of lines is equal to, and I’m just writing a comprehension where I’m taking a sum. Uh, so sum then one for, uh, l or maybe I in open F Okay, so now. If you look at num lines, you can see that we have about 7044 rows in this particular file. Which is my telco Chartcsv right now what we will do, uh, since we need to only have the sample of this particular file. Okay, so I. I basically want to just load a random sample or random rows from this particular file. So what well do we will just take, uh, 50 of this particular data, so not not 50 of the data, but 50 of the rows. So I will just take Maybe my size size will be equal to number of lines divided by so I am taking an integer division two. Now this size is basically what we are trying to load, so these many number of rows is what we will try to load into pandas using panda see pandas read, underscore CSV command, okay, so 3522 rows now, just to make sure that we are not taking the starting 22 352022 or the ending 5000 3000 rows. We will try to take a random subset, so for that particular thing, we will create a random, random list of numbers or random list of numbers between, uh, between, um, between one and this number, so let me just write it, and that will, then that will make you clear what I’m trying to say, so I’ll say, uh, Ids is equal to, and now I’m saying, so I have. I’m using that randomsample function so randomsample, and then I’m writing a range, Uh, one comma, one number of lines, and then how many such so I need size now? What, I’m trying to say over here? I’m saying that generate some number and generate some random numbers between 1 and this number of lines. So this is sorry lines so generate random numbers between 1 and this number and the number should be of this much size, so that basically means I wanted to generate 3000 numbers between or 3000 random numbers, one and seven thousand and forty four. So if I run this, and if I check my Ids, you can see that Ids is a list of random numbers between one and seven thousand and forty four and the size or the length of this list, So let’s check the length of this list. The length of this list will be equal to 3000 something, so it is 3522 exactly equal to the size, exactly equal to the size. Right so now what we have done till now we have just, uh, we have just read the file and we have take. We have basically tried to understand or we have tried to take the number of rows in those into this variable number of lines. So now we have, uh, like after doing this. We know that we have 7044 rows in my file. Then since I just wanted to load the, uh, the half of this file, so half of the random rows from this file. I took this particular division. So these are the number of rows that I wanted to load. Then I have created a random sample. Also, basically a random list of numbers and these numbers will be my index row index row index for each row, which I’ll be going to load using pandas. So now I can, so I have this Id’s right now. I can basically say to my panda. So DF is equal to pdread. Underscore CSV And my file name is f. Which is this file name? This particular file name is F. And now I am writing here that skip skip rows is equal to Ids now. What’s the meaning of this? So I’m basically saying to my pandas that read this file, but skip these rows. Where the id is this so basically? If I show you the ID and let me just show you starting five of them So basically, I’m saying that I’m saying to my pandas that while you are reading this file f please don’t read this row number. Please don’t read this row number. Please don’t read this row number and so on. So remember that in this, ID Ids this. Id is variable. I have 3522 random row numbers, right, which will be skipped while my pandas will be loading the file, right so now if I run this, and if I check my shape of the file, it is 3521 because that one per the top row is your header row, so that’s why 3521 and 21 columns. And now if you check your head, the top five rows, you have the data, but this particular data is randomly subsetted from your original data, so this DF file does not consist of the complete, uh, complete file or the complete CSV file, but it, it basically holds the subset of your complete file, right, so this is basically how you can load the random subset of your file without loading it completely into your pandas. Okay, so that’s it for this video. I hope you enjoyed this video and please don’t forget to comment. If you have any doubt, please like my video and don’t forget to subscribe my channel. I will be going to upload such tricky videos and, uh, tutorials, uh, in in coming coming days, so stay tuned and keep watching my channel, thank you.