擅长:python、mysql、java
<p>我相信这是足够灵活的方式,但这是有效率的。基本上,迭代数据帧,将基于文本的单元格拆分为句子,并在为每个句子保留类别的同时创建新行:</p>
<pre><code>test = """This is a sentence. This is another sentence.
This is a third sentence. We want a separate row for each sentence."""
df = pd.DataFrame({'docs': test, 'category': 'winterland'}, index=[0])
df_new = pd.concat([pd.DataFrame({'doc': doc, 'category': row['category']}, index=[0])
for _, row in df.iterrows()
for doc in row['docs'].split('.') if doc != ''])
</code></pre>
<p>东吴新应该有你想要的输出。您可以在这里使用sent_tokenize,或者对于更高级的句子边界检测,可以使用<a href="https://spacy.io/" rel="nofollow noreferrer">Spacy's</a>sent方法。Spacy有许多惊人的特性,并且非常适合NLP项目。</p>