从单词列表中提取所有匹配的关键字并创建新的数据框

2024-09-30 07:26:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从“意见”列中提取所有匹配的关键字,如果它们与关键字列表中的某个单词匹配,则在新列中打印所有匹配的单词(包括重复的单词)。当前代码只提取第一个匹配的单词,不包括重复的单词

import pandas as pd

df = pd.DataFrame({
    'opinions':[
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']

query = '|'.join(keywords)
df['word'] = df['opinions'].str.extract( '({})'.format(query) )

print(df)

电流输出

enter image description here


Tags: the代码df列表it关键字moviequery
2条回答

如果只想匹配完整单词,则需要使用单词边界标记,否则前缀(和后缀)也将匹配。例如:

import pandas as pd

df = pd.DataFrame({
    'opinions':[
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']

query = '|'.join(keywords)
df['word'] = df['opinions'].str.findall(r'\b({})\b'.format(query))

print(df)

输出

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

在上面的示例中greatness由于单词边界(\b)而不匹配

关于性能的说明

如果您正在寻找一个有效的大数据解决方案,那么联合正则表达式并不是一个好方法(请参见here)。我建议您使用trrex之类的库

import pandas as pd
import trrex as tx

df = pd.DataFrame({
    'opinions': [
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']
query = tx.make(keywords, left=r"\b(", right=r")\b")

df['word'] = df['opinions'].str.findall(r'{}'.format(query))

print(df)

输出(使用trrex)

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

有关性能的比较,请参见下图: enter image description here

对于一组25K字,trrex比union正则表达式快300倍。上图中的实验可以用下面的gist复制

免责声明:我是trrex的作者

应将extract替换为^{}

Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying re.findall() to all the elements in the Series/Index.

print(df)
                                                opinions                word
    0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
    1                              How did they make it?                  []
    2   I had a fantastic time at the cinema last night!         [fantastic]
    3                         I really disliked the cast                  []
    4                        the film was sad and boring                  []
    5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]

相关问题 更多 >

    热门问题