WHT

Dask Dataframe To Pandas | Dask-pandas Dataframe Join

Matthew Rocklin

Subscribe Here

Likes

32

Views

4,383

Dask-pandas Dataframe Join

Transcript:

Alright, so, in this example, we’re going to look at doing a join two tables, One adopts data frame, which is quite large and a much another one to spend a Saturday. This can be quite quite a bit smaller. There’s a common case problem that’s fairly easy to computationally. We’re just going to walk through how to do it with pandas, then with data analytical machine, then with a static frame across a cluster. So I don’t actually have a data set here, so I’d make a fake data set. I’ve got to have many different products that’ll have ratings and values. These products are going to be into a thousand different categories, so many products have the unimaginative names of a A A A A A B and so on all the way down to Z Z Z Z. And I’m going to group those into categories and about a thousand different categories just completely randomly so product. AAA is in category number 143 along with several other products. I would have fake data, so we have our products and they’re in categories. There’s gonna be a fairly small table, so we’re gonna use this table a lot. Its table dimension. It’s going to go throughout our entire data set. This fits in memory very easily, it’s gonna be. This is a nice rich. We’re going to join. We have much larger tables that are going to look like the following. They’re gonna have our products the same product, as before along with some rating and some value. I’m choosing to sort of random numbers here. This is indicative of what we sort of see in the wild. But you know, obviously real. That’ll be different, OK? So this function Fake data takes in a number like 5 and press the data frame with that many values were going to do this both to make a large data frame and it makes several data frames with data frame. So what we want to do? Is we want to merge our fake data set with this catalyst table of categories as we’re using that using the panda’s merge function, which defects an inner joint in this case on the product column so again? This is all just straight pandas cool doing this for years. It’s really well appreciated. It’s a common case problem, so never want to do the same thing with task data frame. So normally we would load in our data through something like the read. Csv Command, 22 Hdfs or s3 or to some local Fallas? I don’t have any data locally, so I’m going to just use our previous fake data function and the docile aid function to create a data frame of many different files or partitions. So you can ignore, omit all of this. Is it just making a, you know, Slightly larger, fake example, we have ten different partitions, each with a ten thousand rows. It’s not very large, they’re just for playing around so the object that we have here is add a static frame and this is the same object you would have if you were loading or data information. Ef-s or s3 I’m just making it in this fake way. So that object, you know, looks a whole lot like a data frame. It looks like this, what we like what we just had, and it has the same. API so we’re gonna do is actually the exact same merge functionality we used before, but now instead of gives you on, the pandas had a frame or Wendy’s on the task data frame, so we’re going to merge our task data frame D F against the same panda’s data frame. We were playing with before this smaller. If it’s in memory dimension data frame and we get again this, you know, New category column added onto our data frame as a result so again exact same syntax, exact same workflow works exactly the same, but now in parallel when we called head who’s looked at a small sample of the data to see what’s going on, lets. Go ahead and do you know some competition might want to do on our full data set, so here going to use the da. Static frame? API, which is exactly like the pen, is API to group by the category and then get out the rating column. So this rating that says, you know whatever product is good or bad and then compute the mean per category, then we’re going to get the ten largest results and actually, then compute the answer. So here we go. That just happened in parallel with all of my cores on this one machine, and we find that, you know, category number six. One one is particularly well rated. And so here you go. You can imagine how this might be useful in your own data. Set, okay, so that wasn’t a stouter frame on a single machine never knew is the exact same thing, but on a cluster of computers, so I’ve set up a cluster Elsewhere with twenty different workers. Each worker has 16 cores and we’re going to make now just a larger example hole that same data frame now instead of having ten partitions gonna have a thousand, then we’re going to go ahead and do our exact Same computation. We’re going to do the join with this merge command, do a group by you through the rating, etc. I have here on the left dashboard of our cluster. You can see our twenty different computers happening and over here on the bottom we have. I’m gonna show the nice progress of our computation. This is happening! Both were seeing both the progress in the notebook and over here on the Left. This is showing us our competitions as they’re occurring and also down here. We’re seeing what each core of our cluster is doing over time, and, you know, right now they’re doing a lot of merging, and in, you know, maybe 10 or 20 seconds. It’ll finish up, so I’m also gonna look at the entirety of our computation here. Let’s go ahead and look at this full history. This is showing us all of the history of our computation. Everything that’s happened so far, so on the. Left, we see these bands of purple and that’s. Our computer is actually making our fake data. So this is the function of this fake data. So on the y-axi’s here we have a lot of different cores is around 300 60 different cores on the X-axis. We have time the left we’re seeing it. You know, making much a fake data and then we’re seeing over here in this green. These merges zoom in a little bit, and the mergers are taking. Looks like something like 3,000 milliseconds, so it takes about it. You know, three seconds to do to affect to join any of the data frames. These red bands here are moving the data moving the dimension table between machines. Those are data transfer and you’ll notice that after every set of merges, there tends to be a little cluster. Let’s see if I can zoom in here. The little cluster of grouped by operations. So we’ve zoomed in now a bit, so these group by operations are happening 200 milliseconds or sometimes you even just, you know, 20 milliseconds, so if we think about our competition holistically, you see the the merging operations are taking, you know, 3 seconds with a group by taking, you know, 200 milliseconds so really, we should be focusing on the merge to make things. Go faster in the future, in particular. I suspect this competition is bottlenecked by the merge because we’re using object II type, so pandas handles texts by using Python objects And that’s a little bit slow so we could maybe accelerator competition in the future by using some of the compendious categorical. And I suspect that would make our competition go. You know, roughly 10 times faster, so that’s it again. We’ve seen us using the pandas merge functionality, both in parallel on a single machine and across a cluster.

0.3.0 | Wor Build 0.3.0 Installation Guide

Transcript: [MUSIC] Okay, so in this video? I want to take a look at the new windows on Raspberry Pi build 0.3.0 and this is the latest version. It's just been released today and this version you have to build by yourself. You have to get your own whim, and then you...

read more