熊猫DataFrame.sum的奇怪行为当列包含字符串值时

import pandas as pd df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]]) df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]]) df2.loc[1,2] = 'hey' df3 = pd.DataFrame(index=range(3), columns=range(3)) for i in range(3): for j in range(3): if (i,j) != (1,2): df3.loc[i,j] = i*3 + j + 1 else: df3.loc[i,j] = 'hey' # df1, df2, df3 look the same as below 0 1 2 0 1 2 3 1 4 5 hey 2 7 8 9

sumrow1 = df1.sum(axis=1) sumrow2 = df2.sum(axis=1) sumrow3 = df3.sum(axis=1) #sumrow1 0 3 1 9 2 15 dtype: int64 #sumrow2 0 3 1 9 2 15 dtype: int64 #sumrow3 0 0.0 1 0.0 2 0.0 dtype: float64

df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]]) df_c = df.copy() for col in df.select_dtypes(['object']).columns: df_c[col] = pd.to_numeric(df_c[col], errors='coerce') df['sum'] = df_c.sum(axis=1) #result 0 1 2 sum 0 1 2 3 6.0 1 4 5 hey 9.0 2 7 8 9 24.0

2条回答

网友

1楼 · 编辑于 2024-10-01 17:26:41

根据您的问题和jpp的诊断，数据帧看起来是相同的，但是它们在第3列的数据类型上有所不同。在

以下是一些比较方法，它们揭示了它们之间的区别：

>>> df1.equals(df3)
False # not so useful, doesn't tell you why they differ

你真正需要的是^{}：

^{pr2}$
pandas.testing.assert_frame_equal()有以下厨房水槽中有用的参数，您可以定制您需要的任何东西：
check_dtype : bool, default True Whether to check the DataFrame dtype is identical. check_index_type : bool / string {‘equiv’}, default False Whether to check the Index class, dtype and inferred_type are identical. check_column_type : bool / string {‘equiv’}, default False Whether to check the columns class, dtype and inferred_type are identical. check_frame_type : bool, default False Whether to check the DataFrame class is identical. check_less_precise : bool or int, default False Specify comparison precision. Only used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal points are compared. If int, then specify the digits to compare check_names : bool, default True Whether to check the Index names attribute. by_blocks : bool, default False Specify how to compare internal data. If False, compare by columns. If True, compare by blocks. check_exact : bool, default False Whether to compare number exactly. check_datetimelike_compat : bool, default False Compare datetime-like which is comparable ignoring dtype. check_categorical : bool, default True Whether to compare internal Categorical exactly. check_like : bool, default False If true, ignore the order of rows & columns

网友
2楼 · 编辑于 2024-10-01 17:26:41

有几个问题：
主要的问题是你的df3的构造有all 数据类型为object的三个系列，而df1和{}有 dtype=int前两个系列。在
Pandas数据帧中的数据按序列[列]组织和存储。因此，类型铸造是按系列进行的。因此，在“行和列”之间求和的逻辑必然不同，对于混合类型不一定一致。在
要想了解第一个问题发生了什么，您必须明白Pandas不会在每次操作后不断检查选择最合适的数据类型。这将非常昂贵。在
您可以自己检查dtypes：
print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes}) {'df1': 0 int64 1 int64 2 object dtype: object, 'df2': 0 int64 1 int64 2 object dtype: object, 'df3': 0 object 1 object 2 object dtype: object}
您可以通过检查是否有空值导致转换后的操作，有选择地对df3应用转换：
^{pr2}$
你应该看到一致的治疗。在这一点上，有必要放弃原来的df3：没有任何地方记录过连续序列类型检查可以在每次操作之后应用于或应该。在
要在行或列之间求和时忽略非数值，可以通过pd.to_numeric和errors='coerce'强制转换：
df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]]) col_sum = df.apply(pd.to_numeric, errors='coerce').sum() row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1) print(col_sum) 0 12.0 1 15.0 2 12.0 dtype: float64 print(row_sum) 0 6.0 1 9.0 2 24.0 dtype: float64

相关问题更多 >

编程相关推荐

热门问题

热门文章