用pandas提取特定单词的句子

import pandas as pd from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize #################Reading in excel file##################### str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx") ################# Defining a function ##################### def sentence_finder(text,word): sentences=sent_tokenize(text) return [sent for sent in sentences if word in word_tokenize(sent)] ################# Finding Context ########################## str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',)) ################# Output file ################################# str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

1. [Snakes are venomous.] [Anaconda is venomous.] 2. [Anaconda lives in Amazon.] [It is venomous.] 3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.] 4. NULL

1条回答

网友

1楼 · 发布于 2024-09-28 21:56:19

方法如下：

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

您可以看到有几个问题，因为sent_tokenizer没有正确地完成它的工作，因为标点符号。在

更新：处理复数。在

以下是最新的df：

^{pr2}$

我们可以使用词干分析器（Wikipedia），例如PorterStemmer。在

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

首先，让我们对搜索到的单词进行词干和小写：

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

现在，我们可以对上述内容进行改进，以包括堵塞：

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object

如果只需要子字符串匹配，请确保搜索的单词是单数，而不是复数。在

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

顺便说一句，我可能会在这里创建一个带有常规for循环的函数，这个具有列表理解的lambda已经不受控制了。在

相关问题更多 >

编程相关推荐

热门问题

热门文章