从单词列表中提取所有匹配的关键字并创建新的数据框

import pandas as pd df = pd.DataFrame({ 'opinions':[ "I think the movie is fantastic. Shame it's so short!", "How did they make it?", "I had a fantastic time at the cinema last night!", "I really disliked the cast", "the film was sad and boring", "Absolutely loved the movie! Can't wait to see part 2", ] }) keywords = ['movie', 'great', 'fantastic', 'loved'] query = '|'.join(keywords) df['word'] = df['opinions'].str.extract( '({})'.format(query) ) print(df)

2条回答

网友

1楼 · 编辑于 2024-09-30 07:26:56

如果只想匹配完整单词，则需要使用单词边界标记，否则前缀（和后缀）也将匹配。例如：

import pandas as pd

df = pd.DataFrame({
    'opinions':[
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']

query = '|'.join(keywords)
df['word'] = df['opinions'].str.findall(r'\b({})\b'.format(query))

print(df)

输出

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

在上面的示例中greatness由于单词边界（\b）而不匹配

关于性能的说明

如果您正在寻找一个有效的大数据解决方案，那么联合正则表达式并不是一个好方法（请参见here）。我建议您使用trrex之类的库

import pandas as pd
import trrex as tx

df = pd.DataFrame({
    'opinions': [
        "I think the movie is fantastic. Shame it's so short!",
        "How did they make it?",
        "I had a fantastic time at the cinema last night!",
        "I really disliked the cast",
        "the film was sad and boring",
        "Absolutely loved the movie! Can't wait to see part 2",
        "He has greatness within"
    ]
})

keywords = ['movie', 'great', 'fantastic', 'loved']
query = tx.make(keywords, left=r"\b(", right=r")\b")

df['word'] = df['opinions'].str.findall(r'{}'.format(query))

print(df)

输出（使用trrex）

                                            opinions                word
0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
1                              How did they make it?                  []
2   I had a fantastic time at the cinema last night!         [fantastic]
3                         I really disliked the cast                  []
4                        the film was sad and boring                  []
5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]
6                            He has greatness within                  []

有关性能的比较，请参见下图：

对于一组25K字，trrex比union正则表达式快300倍。上图中的实验可以用下面的gist复制

免责声明：我是trrex的作者

网友

2楼 · 编辑于 2024-09-30 07:26:56

应将extract替换为^{}：

Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying re.findall() to all the elements in the Series/Index.

print(df)
                                                opinions                word
    0  I think the movie is fantastic. Shame it's so ...  [movie, fantastic]
    1                              How did they make it?                  []
    2   I had a fantastic time at the cinema last night!         [fantastic]
    3                         I really disliked the cast                  []
    4                        the film was sad and boring                  []
    5  Absolutely loved the movie! Can't wait to see ...      [loved, movie]

关于性能的说明

相关问题更多 >

编程相关推荐

热门问题

热门文章