我想以不同的顺序处理两个不同的数据帧，同时忽略特殊字符，空格。如果相同，则更换

| B | :---: |abc| |xyz| |bnm| | non | :-----: | ghj | regex = r"[a-zA-Z]" if sorted(re.split(regex, A["A"], re.MULTILINE | re.IGNORECASE)) == sorted(re.split(regex, B["B"], re.MULTILINE | re.IGNORECASE)): B["B"] = B["B"].replace(A["A"]) else: non.append(B["B"])

2条回答

网友

1楼 · 编辑于 2024-09-27 18:01:36

可以通过str accessor实现pandas中的公共字符串清理。您可以一次链接清理步骤，然后（1）执行内部联接或（2）使用.isin()选择所需的行。显示这两种用法都是为了演示，其中.isin()是更简洁的语法

资料

import pandas as pd
import io

A = pd.read_csv(io.StringIO("""
A
abc
xyz
bnm
"""), sep=r"\s{2,}", engine='python')

B = pd.read_csv(io.StringIO("""
B
ABc
ghj
X_yz
B+NM
"""), sep=r"\s{2,}", engine='python')

解决方案

B["B"] = B["B"].str.replace(r"[^A-Za-z]", "", regex=True)\
               .str.lower()\
               .str.strip()  # if there is trailing spaces

# method 1: join
B_matched = B.merge(A, how="inner", left_on="B", right_on="A")[["B"]]
# method 2: isin
B_non = B[~B["B"].isin(B_matched["B"])].rename(columns={"B": "non"})

输出

print(B_matched)
     B
0  abc
1  xyz
2  bnm

print(B_non)
   non
1  ghj

网友

2楼 · 编辑于 2024-09-27 18:01:36

您可以^{}指定特殊字符，并使用^{}检查匹配项：

B.B = B.B.replace(r'[^a-zA-Z]', '', regex=True)
B['match'] = B.B.apply(lambda b: A.A.str.contains(b, flags=re.IGNORECASE).any())

#      B  match
# 0  ABc   True
# 1  ghj  False
# 2  Xyz   True
# 3  BNM   True

然后对B.match和~B.match使用布尔索引：

B = B[B.match][['B']]

#      B
# 0  ABc
# 2  Xyz
# 3  BNM

non = B[~B.match][['B']]

#      B
# 1  ghj

资料

解决方案

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章