我有一个如下所示的数据帧:
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
我想计算speech列中的字数,但只计算预定义列表中的字数。例如,列表为:
wordlist = ['much', 'good','right']
我想生成一个新列,显示每行中这三个单词的频率。因此,我的预期产出是:
speaker speech words
0 Adam Thank you very much and good afternoon. 2
1 Ben Let me clarify that because I want to make sur... 1
2 Clair By now you should have received a copy of our ... 1
我试过:
df['total'] = 0
for word in df['speech'].str.split():
if word in wordlist:
df['total'] += 1
但是在运行它之后,total
列始终为零。我想知道我的代码怎么了
您可以使用以下矢量化方法:
其中:
如果您有一个非常大的列表和一个大的数据帧要搜索,那么这是一个更快的(运行时方面的)解决方案
我猜这是因为它利用了字典(需要O(N)来构造,需要O(1)来搜索)。就性能而言,正则表达式搜索速度较慢
相关问题 更多 >
编程相关推荐