擅长:python、mysql、java
<p>可以通过<a href="https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str" rel="nofollow noreferrer">str accessor</a>实现<code>pandas</code>中的公共字符串清理。您可以一次链接清理步骤,然后(1)执行内部联接或(2)使用<code>.isin()</code>选择所需的行。显示这两种用法都是为了演示,其中<code>.isin()</code>是更简洁的语法</p>
<h2>资料</h2>
<pre><code>import pandas as pd
import io
A = pd.read_csv(io.StringIO("""
A
abc
xyz
bnm
"""), sep=r"\s{2,}", engine='python')
B = pd.read_csv(io.StringIO("""
B
ABc
ghj
X_yz
B+NM
"""), sep=r"\s{2,}", engine='python')
</code></pre>
<h2>解决方案</h2>
<pre><code>B["B"] = B["B"].str.replace(r"[^A-Za-z]", "", regex=True)\
.str.lower()\
.str.strip() # if there is trailing spaces
# method 1: join
B_matched = B.merge(A, how="inner", left_on="B", right_on="A")[["B"]]
# method 2: isin
B_non = B[~B["B"].isin(B_matched["B"])].rename(columns={"B": "non"})
</code></pre>
<h2>输出</h2>
<pre><code>print(B_matched)
B
0 abc
1 xyz
2 bnm
print(B_non)
non
1 ghj
</code></pre>