For循环将列表中的项与列表中前几行中的所有项进行比较

# count how many ngrams in a row where present in previous rows where condition is met def count_previous(df, ngram_col): out = np.empty(len(df[ngram_col])) # loop through every row for i in range(len(df[ngram_col])): count = 0 # loop through every ngram in the list of strings in the current row current_ng_list = df[ngram_col][i] for ng in current_ng_list: # loop through all previous rows for j in range(i): # check if condition is met, if it is break and move on to next ngram if ng in df[ngram_col][j] and df['label'][j] == 1: count += 1 break else: pass out[i] = count return out data1 = {'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17', '2019-08-02', '2019-10-01'], 'ngram_list': [['ena dio', 'this is a test'], ['this is test'], ['dog cat'], ['birds are awesome'], ['birds are awesome'], ['birds are awesome'], ['dog cat', 'birds are awesome', 'this is a test'], ['ena dio'], ['ena dio', 'this is a test']], 'label': [1, 1, 0, 1,1, 0, 1, 1, 0]} df1 = pd.DataFrame(data1) df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True) df1['counts'] = count_previous(df1, 'ngram_list') Expected output: Date ngram_list label counts 0 2019-07-01 ['ena dio', 'this is a test'] 1 0.0 1 2019-07-01 ['this is test'] 1 0.0 2 2019-07-03 ['dog cat'] 0 0.0 3 2019-08-02 ['birds are awesome'] 1 0.0 4 2019-08-02 ['birds are awesome'] 0 1.0 5 2019-08-02 ['ena dio'] 1 1.0 6 2019-09-03 ['birds are awesome'] 1 1.0 7 2019-09-17 ['dog cat', 'birds are awesome', 'this is a test'] 1 2.0 8 2019-10-01 ['ena dio', 'this is a test'] 0 2.0

1条回答

网友

1楼 · 发布于 2024-10-04 05:30:55

我设法写它（几乎）没有任何for循环。不过，这需要更多的内存，因为您需要创建额外的列

我们的想法是创建一个列，保存我们已经看到的所有Ngram，标签为1。我们将把它们放在一起，这样我们就可以确保不会在复制品上浪费任何内存/时间

def func(x):
    ngrams = x['ngram_list']
    already_seen = x['already_seen']
    seen_sum = sum([ngram in already_seen for ngram in ngrams])
    return seen_sum


df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
# if the label is 0, we don't really care about these ngrams, so we can drop them and fill with previously-seen ones,
# so that we have the continuity of lists in the column. It will come in handy later.
df1['addable'] = (
    df1[['ngram_list']]
        .where(df1['label'] == 1)
        .ffill()
)

# next, we want to get the info about all the previously-seen ngrams. To do so, we can just use `cumsum`
# (since adding list concatenates them) and turn them into a set.
df1['already_seen'] = (
    df1['addable']
        .shift()
        .dropna()
        .cumsum()
        .apply(lambda v: set(v))
)
df1 = df1.dropna()

# only thing left to do is to sum all the previously-seen ngrams for every row.
df1['counts'] = df1.apply(func, axis=1)

编辑：如果这仍然太慢，你可能可以把它降到只有一个。应用，这应该会有所帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章