Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition

Motivation

You might want a tighter control over your sample, or you have a really big dataframe, so you want to repeat the same process on chuncks of it.

This method works for a Pandas dataframe, as well as a list.

RANDOM_SEED_VALUE=42

import random
# Run this seed command everytime you call shuffle (sample)
random.seed(RANDOM_SEED_VALUE)

# shuffle the index
indx= random.sample(
	list(df.index), 
	k= len(list(df.index))
	)

samplez=[]
for i in range(99):
	idx= indx[i*5108: (i+1)*5108]
	samplez.append(df.iloc[idx])

Why 99 and why 5108 In my case, I had df.shape[0] = 505692 and df.shape[0]/99 = 5108.0 which is a whole number. I'm looking for an exhaustive, mutually execlusive partition of the dataframe, and I need a whole number. In this case, I will have 99 disjoint partitions covering the entire dataframe, with 5108 rows in each, such that I still have enough data in each partition, and it's a whole number so there's no left over rows that won't be used.

For a Spark dataframe, First, create a column in the dataframe which has a UDF or a native Spark function that generate a random number between 0 and 1, Then you can select a range of values between 0 and 1 to grab certain rows in the fitler clause afterwards. This will give you the same disjointed cover of the dataframe as above, but you may or may not have the same number of rows in each partition. Remember you can store your partitions, now dataframes, in a list; such that you will have a list of dataframes.

Last updated