用NLTK预处理文本字符串

docid title lineid text tokencount 0 0 A 0 shopping and orders have become more com... 66 1 0 A 1 people wrote to the postal service online... 67 2 0 A 2 text updates really from the U.S. Postal... 43 ...

title lemma count lineid A "data" 0 0 A "data" 1 1 A "data" 4 2 A "shop" 2 0 A "shop" 1 1 A "shop" 2 2 B "data" 4 0 B "data" 0 1 B "data" 2 2 B "shop" 9 0 B "shop" 3 1 B "shop" 1 2 ...

from collections import Counter def occurrence_counter(target_string, query): data = dict(Counter(target_string.split())) count = 0 for key in query: if key in data: count += data[key] return count

2条回答

网友

1楼 · 编辑于 2024-10-01 00:26:05

如果我理解正确，您可以使用内置的pandas函数执行此操作：^{}来计算queries^{}以重塑为最终的柱结构

给定样本df：

df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})

#   docid  title  lineid                                          text
# 0     0      A       0   shopping and orders have become more com...
# 1     0      A       1  people wrote to the postal service online...
# 2     0      A       2   text updates really from the U.S. Postal...

第一个{a3}这个{}：

queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})

#   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
# 0     0      A       0   shopping and orders have become more com...          66            1           1           0
# 1     0      A       1  people wrote to the postal service online...          67            0           0           0
# 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1

然后^{}进入最终的列结构：

df.melt(
    id_vars=['title', 'lineid'],
    value_vars=[f'query_{query}' for query in queries],
    var_name='lemma',
    value_name='count',
).replace(r'^query_', '', regex=True)

#   title  lineid  lemma  count
# 0     A       0  order      1
# 1     A       1  order      0
# 2     A       2  order      0
# 3     A       0   shop      1
# 4     A       1   shop      0
# 5     A       2   shop      0
# 6     A       0   text      0
# 7     A       1   text      0
# 8     A       2   text      1

网友

2楼 · 编辑于 2024-10-01 00:26:05

这将处理您的场景：

import pandas as pd
from collections import Counter

query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)

# row wise frequency count
df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))

output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
    temp = dict()
    for word in wordlist:
        temp['title']  = row['title']
        temp['lemma']  = word
        temp['count']  = row['text_new'][word]
        temp['lineid'] = row['lineid']
    
    output = output.append(temp, ignore_index=True)
#print(output)

相关问题更多 >

编程相关推荐

热门问题

热门文章