Hello!

I want to discuss the operation GROUP BY on pandas

A group by operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The operation GROUP BY is easy to do, however, you have to pay attention in case of having a NaN value on the index.

By default the parameter dropna is equal to True, which means, in case of having a NaN value on the index, it is going to delete this row. To avoid this behavior, you should use dropna=False. Take a look at the code below

Code sample

First, lets create some dummy dataframe.

# libs
import pandas as pd
import numpy as np

# data
df_1 = pd.DataFrame({
    "produto":['a','a','a','b','c',np.nan],
    "valor":  [  1,  1,  1,  1,  1,10    ],
})

Here, lets check our columns. How many nans my df has?

print("how many nans my df has?")
print(df_1.isna().sum().sort_values(),"\n")
how many nans my df has?
valor      0
produto    1
dtype: int64 

Ok. Lets check the sum of all values.

print("sum all values")
print(df_1['valor'].sum(),"\n")
sum all values: 15 

Now, lets create a groupby and sum the values.

df_2 = df_1.groupby(['produto'])['valor'].sum().reset_index()
print("groupby and sum")
print(df_2['valor'].sum(),"\n")
groupby and sum:5

Let’s create a groupby with dropna True.

df_2 = df_1.groupby(['produto'], dropna=True)['valor'].sum().reset_index()
print("groupby and sum dropna True")
print(df_2['valor'].sum(),"\n")
groupby and sum dropna True: 5

Let’s create a groupby with dropna False.

df_2 = df_1.groupby(['produto'], dropna=False)['valor'].sum().reset_index()
print("groupby and sum dropna False")
print(df_2['valor'].sum(),"\n")
groupby and sum dropna False: 15

Here the whole code

# libs
import pandas as pd
import numpy as np

# data
df_1 = pd.DataFrame({
    "produto":['a','a','a','b','c',np.nan],
    "valor":  [  1,  1,  1,  1,  1,10    ],
})

print("how many nans my df has?")
print(df_1.isna().sum().sort_values(),"\n")


print("sum all values")
print(df_1['valor'].sum(),"\n")


df_2 = df_1.groupby(['produto'])['valor'].sum().reset_index()
print("groupby and sum")
print(df_2['valor'].sum(),"\n")
    

df_2 = df_1.groupby(['produto'], dropna=True)['valor'].sum().reset_index()
print("groupby and sum dropna True")
print(df_2['valor'].sum(),"\n")


df_2 = df_1.groupby(['produto'], dropna=False)['valor'].sum().reset_index()
print("groupby and sum dropna False")
print(df_2['valor'].sum(),"\n")

Leave a Reply