For循环将列表中的项与列表中前几行中的所有项进行比较

2024-10-04 05:30:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个按日期排序的熊猫数据框,其中每一行都有一个字符串列表。对于每个字符串列表,我希望将其每个字符串与前一行中的前一个列表中的所有字符串进行比较。如果在前一行中找到一个字符串,并且满足条件dataframe['label']=1,那么我可以添加+1并继续下一个字符串

目前,对于18k行的数据帧,这涉及到太多丑陋的for循环。我想知道是否有人能帮我加速这个功能

# count how many ngrams in a row where present in previous rows where condition is met
def count_previous(df, ngram_col):
    out = np.empty(len(df[ngram_col]))
    # loop through every row
    for i in range(len(df[ngram_col])):
        count = 0
        # loop through every ngram in the list of strings in the current row
        current_ng_list = df[ngram_col][i]
        for ng in current_ng_list:
            # loop through all previous rows
            for j in range(i):
                # check if condition is met, if it is break and move on to next ngram
                if ng in df[ngram_col][j] and df['label'][j] == 1:
                    count += 1
                    break
                else:
                    pass
        out[i] = count
    return out


data1 = {'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17',
                 '2019-08-02', '2019-10-01'],
        'ngram_list': [['ena dio', 'this is a test'], ['this is test'], ['dog cat'],
            ['birds are awesome'], ['birds are awesome'], ['birds are awesome'], ['dog cat', 'birds are awesome', 'this is a test'], ['ena dio'],
                       ['ena dio', 'this is a test']],
         'label': [1, 1, 0, 1,1, 0, 1, 1, 0]}
df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
df1['counts'] = count_previous(df1, 'ngram_list')


Expected output: 

         Date                                    ngram_list  label  counts
0  2019-07-01                     ['ena dio', 'this is a test']    1     0.0
1  2019-07-01                                ['this is test']      1     0.0
2  2019-07-03                                     ['dog cat']      0     0.0
3  2019-08-02                           ['birds are awesome']      1     0.0
4  2019-08-02                           ['birds are awesome']      0     1.0
5  2019-08-02                                     ['ena dio']      1     1.0
6  2019-09-03                           ['birds are awesome']      1     1.0
7  2019-09-17  ['dog cat', 'birds are awesome', 'this is a test']  1     2.0
8  2019-10-01                     ['ena dio', 'this is a test']    0     2.0

Tags: 字符串intestdfiscountcolthis
1条回答
网友
1楼 · 发布于 2024-10-04 05:30:55

我设法写它(几乎)没有任何for循环。不过,这需要更多的内存,因为您需要创建额外的列

我们的想法是创建一个列,保存我们已经看到的所有Ngram,标签为1。我们将把它们放在一起,这样我们就可以确保不会在复制品上浪费任何内存/时间

def func(x):
    ngrams = x['ngram_list']
    already_seen = x['already_seen']
    seen_sum = sum([ngram in already_seen for ngram in ngrams])
    return seen_sum


df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
# if the label is 0, we don't really care about these ngrams, so we can drop them and fill with previously-seen ones,
# so that we have the continuity of lists in the column. It will come in handy later.
df1['addable'] = (
    df1[['ngram_list']]
        .where(df1['label'] == 1)
        .ffill()
)

# next, we want to get the info about all the previously-seen ngrams. To do so, we can just use `cumsum`
# (since adding list concatenates them) and turn them into a set.
df1['already_seen'] = (
    df1['addable']
        .shift()
        .dropna()
        .cumsum()
        .apply(lambda v: set(v))
)
df1 = df1.dropna()

# only thing left to do is to sum all the previously-seen ngrams for every row.
df1['counts'] = df1.apply(func, axis=1)

编辑:如果这仍然太慢,你可能可以把它降到只有一个。应用,这应该会有所帮助

相关问题 更多 >