我有一个按日期排序的熊猫数据框,其中每一行都有一个字符串列表。对于每个字符串列表,我希望将其每个字符串与前一行中的前一个列表中的所有字符串进行比较。如果在前一行中找到一个字符串,并且满足条件dataframe['label']=1,那么我可以添加+1并继续下一个字符串
目前,对于18k行的数据帧,这涉及到太多丑陋的for循环。我想知道是否有人能帮我加速这个功能
# count how many ngrams in a row where present in previous rows where condition is met
def count_previous(df, ngram_col):
out = np.empty(len(df[ngram_col]))
# loop through every row
for i in range(len(df[ngram_col])):
count = 0
# loop through every ngram in the list of strings in the current row
current_ng_list = df[ngram_col][i]
for ng in current_ng_list:
# loop through all previous rows
for j in range(i):
# check if condition is met, if it is break and move on to next ngram
if ng in df[ngram_col][j] and df['label'][j] == 1:
count += 1
break
else:
pass
out[i] = count
return out
data1 = {'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17',
'2019-08-02', '2019-10-01'],
'ngram_list': [['ena dio', 'this is a test'], ['this is test'], ['dog cat'],
['birds are awesome'], ['birds are awesome'], ['birds are awesome'], ['dog cat', 'birds are awesome', 'this is a test'], ['ena dio'],
['ena dio', 'this is a test']],
'label': [1, 1, 0, 1,1, 0, 1, 1, 0]}
df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
df1['counts'] = count_previous(df1, 'ngram_list')
Expected output:
Date ngram_list label counts
0 2019-07-01 ['ena dio', 'this is a test'] 1 0.0
1 2019-07-01 ['this is test'] 1 0.0
2 2019-07-03 ['dog cat'] 0 0.0
3 2019-08-02 ['birds are awesome'] 1 0.0
4 2019-08-02 ['birds are awesome'] 0 1.0
5 2019-08-02 ['ena dio'] 1 1.0
6 2019-09-03 ['birds are awesome'] 1 1.0
7 2019-09-17 ['dog cat', 'birds are awesome', 'this is a test'] 1 2.0
8 2019-10-01 ['ena dio', 'this is a test'] 0 2.0
我设法写它(几乎)没有任何
for
循环。不过,这需要更多的内存,因为您需要创建额外的列我们的想法是创建一个列,保存我们已经看到的所有Ngram,标签为1。我们将把它们放在一起,这样我们就可以确保不会在复制品上浪费任何内存/时间
编辑:如果这仍然太慢,你可能可以把它降到只有一个。应用,这应该会有所帮助
相关问题 更多 >
编程相关推荐