用NLTK预处理文本字符串问题的回答

用NLTK预处理文本字符串

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据框架a，其中包含docid（文档ID）、title（文章标题）、lineid（行ID，也称为段落位置）、text和tokencount（包括空格在内的单词计数）： <pre><code> docid title lineid text tokencount 0 0 A 0 shopping and orders have become more com... 66 1 0 A 1 people wrote to the postal service online... 67 2 0 A 2 text updates really from the U.S. Postal... 43 ... </code></pre> 我想基于包含<code>title</code>、<code>lineid</code>、<code>count</code>和<code>query</code>的数据帧创建一个新的数据帧 <code>query</code>是包含一个或多个单词的文本字符串，如“数据分析”、“文本消息”或“购物和订单” <code>count</code>是<code>query</code>中每个单词的计数 新的数据帧应如下所示： <pre><code>title lemma count lineid A "data" 0 0 A "data" 1 1 A "data" 4 2 A "shop" 2 0 A "shop" 1 1 A "shop" 2 2 B "data" 4 0 B "data" 0 1 B "data" 2 2 B "shop" 9 0 B "shop" 3 1 B "shop" 1 2 ... </code></pre> 如何制作一个函数来生成这个新的数据帧 <hr/> 我已经从具有列<code>count</code>的创建了一个新的数据帧<code>df</code> <pre><code>df = A[['title','lineid']] df['count'] = 0 df.set_index(['title','lineid'], inplace=True) </code></pre> 此外，我还创建了一个计算查询单词数的函数 <pre><code>from collections import Counter def occurrence_counter(target_string, query): data = dict(Counter(target_string.split())) count = 0 for key in query: if key in data: count += data[key] return count </code></pre> 但是，如何使用它们来生成新数据帧的函数呢

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如果我理解正确，您可以使用内置的pandas函数执行此操作：<a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.str.count.html" rel="nofollow noreferrer">^{<cd1>}</a>来计算<code>queries</code><a href="https://pandas.pydata.org/docs/reference/api/pandas.melt.html" rel="nofollow noreferrer">^{<cd3>}</a>以重塑为最终的柱结构 给定样本<code>df</code>： <pre class="lang-py prettyprint-override"><code>df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...', 1: 'people wrote to the postal service online...', 2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}}) # docid title lineid text # 0 0 A 0 shopping and orders have become more com... # 1 0 A 1 people wrote to the postal service online... # 2 0 A 2 text updates really from the U.S. Postal... </code></pre> 第一个{a3}这个{<cd2>}： <pre class="lang-py prettyprint-override"><code>queries = ['order', 'shop', 'text'] df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries}) # docid title lineid text tokencount query_order query_shop query_text # 0 0 A 0 shopping and orders have become more com... 66 1 1 0 # 1 0 A 1 people wrote to the postal service online... 67 0 0 0 # 2 0 A 2 text updates really from the U.S. Postal... 43 0 0 1 </code></pre> 然后<a href="https://pandas.pydata.org/docs/reference/api/pandas.melt.html" rel="nofollow noreferrer">^{<cd3>}</a>进入最终的列结构： <pre class="lang-py prettyprint-override"><code>df.melt( id_vars=['title', 'lineid'], value_vars=[f'query_{query}' for query in queries], var_name='lemma', value_name='count', ).replace(r'^query_', '', regex=True) # title lineid lemma count # 0 A 0 order 1 # 1 A 1 order 0 # 2 A 2 order 0 # 3 A 0 shop 1 # 4 A 1 shop 0 # 5 A 2 shop 0 # 6 A 0 text 0 # 7 A 1 text 0 # 8 A 2 text 1 </code></pre>

用NLTK预处理文本字符串

1 个回答

相关Python问题