如何过滤Pandas中的分类数据

2024-09-27 04:21:46 发布

您现在位置:Python中文网/ 问答频道 /正文

这是数据的信息

    sex     age         race        
    Male    0.204082    Hispanic    
    Male    0.122449    African-American    
    Female  0.163265    African-American    
    Male    0.081633    African-American    
    Male    0.530612    African-American
African-American    2968
Caucasian           1969
Hispanic             502
Other                294
Asian                 26
Native American       13
Name: race, dtype: int64 

我想从数据集中基本上删除印第安人和亚洲人,我就是这么做的:

df_train_val_scaled = df_train_val_scaled[df_train_val_scaled["race"] != "Native American" & df_train_val_scaled["race"] != "Asian"]

这导致了以下错误:

TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]

所以我尝试了以下方法

df_train_val_scaled = df_train_val_scaled[df_train_val_scaled["race"] not in ["Native American", "Asian"]]

但它也会产生错误

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

谢谢你的帮助


Tags: of数据df错误trainvalmalebool
2条回答

诀窍是用~df['race'].isin(['a', 'b', c'])检查是否每个元素都(不是)在给定的列表中。下面是一个例子:

from io import StringIO as sio

data = sio("""
 sex     age         race        
    Male    0.204082    Hispanic    
    Male    0.122449    African-American    
    Female  0.163265    African-American    
    Male    0.081633    African-American    
    Male    0.530612    African-American
""")

import pandas as pd
df = pd.read_csv(data, sep='\s+').astype({'race': 'category'})

df_train_val_scaled = df[~df["race"].isin(["Native American", "Asian"])]
df_train_val_scaled

您可以使用isin()函数根据任何列值过滤数据帧,该函数返回一个布尔序列,该序列可以传递给数据帧以获得过滤结果。
您可以将此布尔序列传递给DataFrame,DataFrame根据传递的布尔序列过滤行后返回DataFrame

import pandas as pd

people = {
    'sex': ['Male', 'Male', 'Male', 'Female', 'Male'],
    'age': [0.204082, 0.163265, 0.204082, 0.214082, 0.204082],
    'race': ['Hispanic', 'African-American', 'Asian', 'Asian', 'Asian']
}

df = pd.DataFrame(people)

filter_ = ~df['race'].isin(['African-American', 'Asian'])

print(filter_)

# 0     True
# 1    False
# 2    False
# 3    False
# 4    False
# Name: race, dtype: bool

df_filtered = df[filter_]
print(df_filtered)

#     sex       age      race
# 0  Male  0.204082  Hispanic

相关问题 更多 >

    热门问题