如何基于记录中其他4个字段上的布尔运算符有效地更新数据帧中的字段？

import numpy as np import pandas as pd df = pd.read_csv(fileIn, header=0) df['match_count']= 0 df['exclude']= False # This for loop takes 300+ seconds to execute 100'000 times for index, row in df.iterrows(): matchCounter = 0 if row['in_deeds'] > 0: matchCounter += 1 if row['in_valuation'] > 0: matchCounter += 1 if row['in_property'] > 0: matchCounter += 1 if row['in_sg'] > 0: matchCounter += 1 df.loc[index,'match_count'] = matchCounter # This for loop takes only 11.75 seconds i=0 for index, row in df.iterrows(): if "EXCL" in row['stat_deeds'].upper(): i=i+1 df.loc[index,'exclude']=True elif "EXCL" in row['stat_valuation'].upper(): i=i+1 df.loc[index,'exclude']=True elif "EXCL" in row['stat_property'].upper(): i=i+1 df.loc[index,'exclude']=True elif "EXCL" in row['stat_sg'].upper(): i=i+1 df.loc[index,'exclude']=True df = df.query('exclude == False')

2条回答

网友

1楼 · 编辑于 2024-09-28 23:50:12

操作后更新注释：

df['match_count']=(df[['in_deeds','in_valuation','in_property','in_sg']]>0).astype(int).sum(axis=1)

下面还将通过获取匹配计数的累计和来提供每个点（每行）的匹配总数。你知道吗

df['match_count']=(df[['in_deeds','in_valuation','in_property','in_sg']]>0).astype(int).sum(axis=1).cumsum()

一件一件地：

我们首先检查（每行）指定列中的值是否大于零。这将返回一个布尔值True或False，我们将其转换为整数.astype(int)

df[['in_deeds','in_valuation','in_property','in_sg']]>0).astype(int)

然后我们对每一行的值求和.sum(axis=1)。
这将返回一个列，其中每行上我们知道满足了多少条件（>0）。你知道吗

最后，我们计算行之间的累积和，以获得（每行）匹配的总数。你知道吗

最后，我们在原始数据帧df中创建一个新列df['match_count']=，并将结果分配给该列。你知道吗

网友

2楼 · 编辑于 2024-09-28 23:50:12

在过去的数据帧迭代中，我也遇到过类似的问题-^{}乍一看似乎是正确的选择，因为它易于使用，但是它的方便是有代价的。这里的a helpful blog概述了pandas中的方法，以提高迭代效率。你知道吗

结果是-不要使用iterrows。一般来说，可以使用索引作为迭代器，然后使用df.loc或df.iloc访问数据帧的行，如下所示：

for i in df.index:
  print(df.loc[i, :])

使用`df.apply`

apply方法允许您将用户定义的函数应用于数据帧的所有列或行。虽然这里的用法可能有些不直观，但它是迄今为止最快的：

import numpy as np
import pandas as pd

def counter(row):

    if np.any(row[row > 0]):
        return np.sum(row[row > 0])
    else:
        return 0

N = 100000

df = pd.DataFrame({'A': np.random.randint(0, 2, N),
                   'B': np.random.randint(0, 2, N),
                   'C': np.random.randint(0, 2, N),
                   'D': np.random.randint(0, 2, N)})

df['match-count'] = df.apply(counter, axis=1, raw=True)

这里，函数将检查数据帧的每一行（由axis=1指定）；np.any返回True如果布尔选择row[row > 0]不是空的，此时布尔选择用np.sum减少以获得最终计数。我们将raw关键字参数设置为True，以便传递原始的numpy数组，该数组应用于减少操作（如sum）以提高性能（请参见docs）。你知道吗

在我的机器上运行大约需要1.2秒。你知道吗

编辑

Gio的回答显示了一个原则，我认为这是使用pandas时的一个很好的实践——如果存在可以直接对数据帧进行操作的方法（例如sum，cumsum），那么尝试使用这些方法，因为它们总是会更快。你知道吗

在这样的方法不存在的地方，df.apply如果指定要应用的更复杂的操作，那么它会很有用-这只是未来的一个提示！你知道吗

编辑II

上面带有apply的示例假设dataframe中的所有列都用于布尔选择。如果只有特定列具有需要用于计数器的数值，请在counter方法中使用Gio的建议：

def counter(row):

    selection = row[['in_deeds', 'in_valuation', 'in_property', 'in_sg']] > 0

    if np.any(selection):
        return np.sum(selection)
    else:
        return 0

使用`df.apply`

编辑

编辑II

相关问题更多 >

编程相关推荐

热门问题

热门文章