Pandas - 找到所有特殊列包含某部分文本的行

2 投票
1 回答
70 浏览
提问于 2025-04-14 16:40

我遇到了一个问题,想要在一个数据表(DataFrame)中找到那些有两列包含某个字符串的一部分的行。

这两列的值都是字符串类型。
我想要的功能其实是和 str.containsisin() 相反,因为我想用列的值来做子字符串的匹配。
这个字符串不容易被清晰地拆分,因为“城市名”、“区名”和“街道名”这三个值可能会有空格。

你能帮我吗?

s = "Bad Testcity Teststr."
df_res = df.loc[(s.find(df['CITY'] != -1) & (s.find(df['DISTRICT'] != -1) & (s.find(df['STREET'] != -1)]`

这个示例应该返回 TRUE。

<bound method DataFrame.info of             ZIP              CITY               STREET NUMBER   NUMBER_SFX         DISTRICT   ONKZ ASB      ADSL      VDSL   VDSL_SV         VPSZ OUTDOOR
ID
4025217   12345  Bad Testcity          Teststr.          6              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025219   12345  Bad Testcity          Teststr.          7              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025242   12345  Bad Testcity          Teststr.          8              NaN  Bad Testcity  12345   2  +017.696  +102.784  +185.824  49/12345/30       O
4025244   12345  Bad Testcity          Teststr.         10              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025245   12345  Bad Testcity          Teststr.         11              NaN  Bad Testcity  12345   2  +017.696  +051.392       NaN  49/12345/30       O
...         ...              ...                   ...        ...              ...              ...    ...  ..       ...       ...       ...          ...     ...

[1569530 rows x 13 columns]>

1 个回答

0

假设输入是这样的:

           ZIP          CITY    STREET  NUMBER  NUMBER_SFX      DISTRICT   ONKZ  ASB    ADSL     VDSL  VDSL_SV         VPSZ OUTDOOR
ID                                                                                                                                 
4025217  12345  Bad Testcity  Teststr.       6         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025219  12345  Bad Testcity  Teststr.       7         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025242  12345  Bad Testcity  Teststr.       8         NaN  Bad Testcity  12345    2  17.696  102.784  185.824  49/12345/30       O
4025244  12345  Bad Testcity  Teststr.      10         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025245  12345  Bad Testcity  Teststr.      11         NaN  Bad Testcity  12345    2  17.696   51.392      NaN  49/12345/30       O

你可以把这些列用空格连接起来,然后在结果上使用 str.contains 方法:

s = "Bad Testcity Teststr."

df_res = df.loc[(df['CITY']+' '+df['DISTRICT']+' '+df['STREET']).str.contains(s)]

还有一种效率较低的替代方法(不过有时候也挺有用的):

df_res = df.loc[df[['CITY', 'DISTRICT', 'STREET']]
                .apply(' '.join, axis=1)
                .str.contains(s)]

输出(这里没有变化):

           ZIP          CITY    STREET  NUMBER  NUMBER_SFX      DISTRICT   ONKZ  ASB    ADSL     VDSL  VDSL_SV         VPSZ OUTDOOR
ID                                                                                                                                 
4025217  12345  Bad Testcity  Teststr.       6         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025219  12345  Bad Testcity  Teststr.       7         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025242  12345  Bad Testcity  Teststr.       8         NaN  Bad Testcity  12345    2  17.696  102.784  185.824  49/12345/30       O
4025244  12345  Bad Testcity  Teststr.      10         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025245  12345  Bad Testcity  Teststr.      11         NaN  Bad Testcity  12345    2  17.696   51.392      NaN  49/12345/30       O

撰写回答