用NLTK预处理文本字符串

2024-10-01 00:26:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框架a,其中包含docid(文档ID)、title(文章标题)、lineid(行ID,也称为段落位置)、text和tokencount(包括空格在内的单词计数):

  docid   title  lineid                                         text        tokencount
0     0     A        0   shopping and orders have become more com...                66
1     0     A        1  people wrote to the postal service online...                67
2     0     A        2   text updates really from the U.S. Postal...                43
...

我想基于包含titlelineidcountquery的数据帧创建一个新的数据帧

query是包含一个或多个单词的文本字符串,如“数据分析”、“文本消息”或“购物和订单”

countquery中每个单词的计数

新的数据帧应如下所示:

title  lemma   count   lineid
  A    "data"    0        0
  A    "data"    1        1
  A    "data"    4        2
  A    "shop"    2        0
  A    "shop"    1        1
  A    "shop"    2        2
  B    "data"    4        0
  B    "data"    0        1
  B    "data"    2        2
  B    "shop"    9        0
  B    "shop"    3        1
  B    "shop"    1        2
...

如何制作一个函数来生成这个新的数据帧


我已经从具有列count的创建了一个新的数据帧df

df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)

此外,我还创建了一个计算查询单词数的函数

from collections import Counter

def occurrence_counter(target_string, query):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in query:
        if key in data:
            count += data[key]
    return count

但是,如何使用它们来生成新数据帧的函数呢


Tags: 数据key函数textiddfdatatitle
2条回答

如果我理解正确,您可以使用内置的pandas函数执行此操作:^{}来计算queries^{}以重塑为最终的柱结构

给定样本df

df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})

#   docid  title  lineid                                          text
# 0     0      A       0   shopping and orders have become more com...
# 1     0      A       1  people wrote to the postal service online...
# 2     0      A       2   text updates really from the U.S. Postal...

第一个{a3}这个{}:

queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})

#   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
# 0     0      A       0   shopping and orders have become more com...          66            1           1           0
# 1     0      A       1  people wrote to the postal service online...          67            0           0           0
# 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1

然后^{}进入最终的列结构:

df.melt(
    id_vars=['title', 'lineid'],
    value_vars=[f'query_{query}' for query in queries],
    var_name='lemma',
    value_name='count',
).replace(r'^query_', '', regex=True)

#   title  lineid  lemma  count
# 0     A       0  order      1
# 1     A       1  order      0
# 2     A       2  order      0
# 3     A       0   shop      1
# 4     A       1   shop      0
# 5     A       2   shop      0
# 6     A       0   text      0
# 7     A       1   text      0
# 8     A       2   text      1

这将处理您的场景:

import pandas as pd
from collections import Counter

query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)

# row wise frequency count
df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))

output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
    temp = dict()
    for word in wordlist:
        temp['title']  = row['title']
        temp['lemma']  = word
        temp['count']  = row['text_new'][word]
        temp['lineid'] = row['lineid']
    
    output = output.append(temp, ignore_index=True)
#print(output)

相关问题 更多 >