Groupby and dropna

Hello!

I want to discuss the operation GROUP BY on pandas.

A group by operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The operation GROUP BY is easy to do, however, you have to pay attention in case of having a NaN value on the index.

By default the parameter dropna is equal to True, which means, in case of having a NaN value on the index, it is going to delete this row. To avoid this behavior, you should use dropna=False. Take a look at the code below

Code sample

First, lets create some dummy dataframe.

# libs
import pandas as pd
import numpy as np

# data
df_1 = pd.DataFrame({
    "produto":['a','a','a','b','c',np.nan],
    "valor":  [  1,  1,  1,  1,  1,10    ],
})

Here, lets check our columns. How many nans my df has?

print("how many nans my df has?")
print(df_1.isna().sum().sort_values(),"\n")

how many nans my df has?
valor      0
produto    1
dtype: int64

Ok. Lets check the sum of all values.

print("sum all values")
print(df_1['valor'].sum(),"\n")

sum all values: 15

Now, lets create a groupby and sum the values.

df_2 = df_1.groupby(['produto'])['valor'].sum().reset_index()
print("groupby and sum")
print(df_2['valor'].sum(),"\n")

groupby and sum:5

Let’s create a groupby with dropna True.

df_2 = df_1.groupby(['produto'], dropna=True)['valor'].sum().reset_index()
print("groupby and sum dropna True")
print(df_2['valor'].sum(),"\n")

groupby and sum dropna True: 5

Let’s create a groupby with dropna False.

df_2 = df_1.groupby(['produto'], dropna=False)['valor'].sum().reset_index()
print("groupby and sum dropna False")
print(df_2['valor'].sum(),"\n")

groupby and sum dropna False: 15

Here the whole code

# libs
import pandas as pd
import numpy as np

# data
df_1 = pd.DataFrame({
    "produto":['a','a','a','b','c',np.nan],
    "valor":  [  1,  1,  1,  1,  1,10    ],
})

print("how many nans my df has?")
print(df_1.isna().sum().sort_values(),"\n")


print("sum all values")
print(df_1['valor'].sum(),"\n")


df_2 = df_1.groupby(['produto'])['valor'].sum().reset_index()
print("groupby and sum")
print(df_2['valor'].sum(),"\n")
    

df_2 = df_1.groupby(['produto'], dropna=True)['valor'].sum().reset_index()
print("groupby and sum dropna True")
print(df_2['valor'].sum(),"\n")


df_2 = df_1.groupby(['produto'], dropna=False)['valor'].sum().reset_index()
print("groupby and sum dropna False")
print(df_2['valor'].sum(),"\n")

Code sample

Here the whole code

Erick Mattoso

Trending Now

Recent Posts

Categories

Previous PostSyrian War with Folium and Time Stamped GeoJSON

Next PostAltair Library

Leave a Reply Cancel Reply

Groupby and dropna

Code sample

Here the whole code

Erick Mattoso

Trending Now

Recent Posts

Categories

Tags

Previous PostSyrian War with Folium and Time Stamped GeoJSON

Next PostAltair Library

Recommended For You

How to create a Alteryx Python Engine SDK Tool?

Syrian War with Folium and Time Stamped GeoJSON

Altair Library

Leave a Reply Cancel Reply