如何在pandas dataframe中拆分文档并为每个senten创建行

2条回答

网友

1楼 · 编辑于 2024-09-27 07:28:58

另一种方法是按('.')拆分

所以用和数据摔跤手一样的测试：

test = """This is a sentence. This is another sentence. This is a third sentence. We want a separate row for each sentence."""

我们可以将行拆分为一个列表，并将其输入到数据帧中，如下所示：

df = pd.DataFrame({'docs': test.split('.'), 'category': 'winterland'})

结果的唯一区别是，您将在底部有一个空行，如果需要，您可以过滤掉它，或者，您可以在创建dataframe时使用list comprehension来排除如下空白行：

pd.DataFrame({'docs': [sentence for sentence in test.split('.') if sentence !=''], 'category': 'winterland'})

网友

2楼 · 编辑于 2024-09-27 07:28:58

我相信这是足够灵活的方式，但这是有效率的。基本上，迭代数据帧，将基于文本的单元格拆分为句子，并在为每个句子保留类别的同时创建新行：

test = """This is a sentence. This is another sentence. 
          This is a third sentence. We want a separate row for each sentence."""


df = pd.DataFrame({'docs': test, 'category': 'winterland'}, index=[0])

df_new = pd.concat([pd.DataFrame({'doc': doc, 'category': row['category']}, index=[0]) 
           for _, row in df.iterrows() 
           for doc in row['docs'].split('.') if doc != ''])

东吴新应该有你想要的输出。您可以在这里使用sent_tokenize，或者对于更高级的句子边界检测，可以使用Spacy'ssent方法。Spacy有许多惊人的特性，并且非常适合NLP项目。

相关问题更多 >

编程相关推荐

热门问题

热门文章