用pandas提取特定单词的句子

2024-09-28 21:56:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有文本列的excel文件。我所要做的就是从文本列中为每一行提取带有特定单词的句子。在

我试过用定义函数。在

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但是如果我要找一个有多个特定单词的句子,比如snakesvenomousanaconda,有人能帮我吗。这个句子应该至少有一个词。我无法处理多个单词的nltk.tokenize。在

要搜索words = ['snakes','venomous','anaconda']

输入Excel文件:

^{pr2}$

期望输出:

名为Context的列附加到上面的文本列。上下文列应该类似于:

 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL

提前谢谢。在


Tags: in文本importdfis单词excel句子
1条回答
网友
1楼 · 发布于 2024-09-28 21:56:19

方法如下:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

您可以看到有几个问题,因为sent_tokenizer没有正确地完成它的工作,因为标点符号。在


更新:处理复数。在

以下是最新的df:

^{pr2}$

我们可以使用词干分析器(Wikipedia),例如PorterStemmer。在

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

首先,让我们对搜索到的单词进行词干和小写:

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

现在,我们可以对上述内容进行改进,以包括堵塞:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object

如果只需要子字符串匹配,请确保搜索的单词是单数,而不是复数。在

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

顺便说一句,我可能会在这里创建一个带有常规for循环的函数,这个具有列表理解的lambda已经不受控制了。在

相关问题 更多 >