将列内容与关键字匹配的Pandas(带空格和括号)

2024-09-28 19:27:32 发布

您现在位置:Python中文网/ 问答频道 /正文

数据框中的列包含我要匹配的关键字

我想检查每一列是否包含任何关键字。如果是,请打印它们

尝试如下:

import pandas as pd
import re

Keywords = [

"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]

data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}

df = pd.DataFrame(data)

pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)

df["found"] = df['People'].str.findall(pat).str.join('; ')

print df["found"]

它返回Nan。我想挑战在于关键词中的空格和括号

获得理想输出的正确方法是什么?多谢各位

Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q

Tags: 数据importpandasdfdata关键字peoplepd
2条回答

嘿,我不知道这个解决方案是否是最优的,但它是有效的。我刚刚用8替换了点,用9替换了“('6')”,不知道为什么str.findall会忽略这些字符

{8,6,9}与{'.','('.',')}之间的一种双射

for i in range(len(Keywords)): 
    Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])): 
    df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')

然后应用你的函数

  pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
  df["found"] = df['People'].str.findall(pat).str.join('; ')

最后一步返回{'.','('.',')}

for i in range(len(df['found'])): 
  df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
  df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')

由于您不需要查找每个关键字,但是如果它们重叠,则可以使用带有findall方法的正则表达式来查找最长的关键字

这里的要点是,首先需要按长度降序对关键字进行排序(因为关键字中有空格),然后需要转义这些值,因为它们包含特殊字符,然后必须修改单词边界以使用明确的单词边界、(?<!\w)(?!\w)(注意\b是上下文相关的)

使用

pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))

online Python test

import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
    print(re.findall(rx, s))

输出

['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']

相关问题 更多 >