Pandas - 找到所有特殊列包含某部分文本的行
我遇到了一个问题,想要在一个数据表(DataFrame)中找到那些有两列包含某个字符串的一部分的行。
这两列的值都是字符串类型。
我想要的功能其实是和 str.contains
或 isin()
相反,因为我想用列的值来做子字符串的匹配。
这个字符串不容易被清晰地拆分,因为“城市名”、“区名”和“街道名”这三个值可能会有空格。
你能帮我吗?
s = "Bad Testcity Teststr."
df_res = df.loc[(s.find(df['CITY'] != -1) & (s.find(df['DISTRICT'] != -1) & (s.find(df['STREET'] != -1)]`
这个示例应该返回 TRUE。
<bound method DataFrame.info of ZIP CITY STREET NUMBER NUMBER_SFX DISTRICT ONKZ ASB ADSL VDSL VDSL_SV VPSZ OUTDOOR
ID
4025217 12345 Bad Testcity Teststr. 6 NaN Bad Testcity 12345 2 +017.696 +102.784 NaN 49/12345/30 O
4025219 12345 Bad Testcity Teststr. 7 NaN Bad Testcity 12345 2 +017.696 +102.784 NaN 49/12345/30 O
4025242 12345 Bad Testcity Teststr. 8 NaN Bad Testcity 12345 2 +017.696 +102.784 +185.824 49/12345/30 O
4025244 12345 Bad Testcity Teststr. 10 NaN Bad Testcity 12345 2 +017.696 +102.784 NaN 49/12345/30 O
4025245 12345 Bad Testcity Teststr. 11 NaN Bad Testcity 12345 2 +017.696 +051.392 NaN 49/12345/30 O
... ... ... ... ... ... ... ... .. ... ... ... ... ...
[1569530 rows x 13 columns]>
1 个回答
0
假设输入是这样的:
ZIP CITY STREET NUMBER NUMBER_SFX DISTRICT ONKZ ASB ADSL VDSL VDSL_SV VPSZ OUTDOOR
ID
4025217 12345 Bad Testcity Teststr. 6 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025219 12345 Bad Testcity Teststr. 7 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025242 12345 Bad Testcity Teststr. 8 NaN Bad Testcity 12345 2 17.696 102.784 185.824 49/12345/30 O
4025244 12345 Bad Testcity Teststr. 10 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025245 12345 Bad Testcity Teststr. 11 NaN Bad Testcity 12345 2 17.696 51.392 NaN 49/12345/30 O
你可以把这些列用空格连接起来,然后在结果上使用 str.contains
方法:
s = "Bad Testcity Teststr."
df_res = df.loc[(df['CITY']+' '+df['DISTRICT']+' '+df['STREET']).str.contains(s)]
还有一种效率较低的替代方法(不过有时候也挺有用的):
df_res = df.loc[df[['CITY', 'DISTRICT', 'STREET']]
.apply(' '.join, axis=1)
.str.contains(s)]
输出(这里没有变化):
ZIP CITY STREET NUMBER NUMBER_SFX DISTRICT ONKZ ASB ADSL VDSL VDSL_SV VPSZ OUTDOOR
ID
4025217 12345 Bad Testcity Teststr. 6 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025219 12345 Bad Testcity Teststr. 7 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025242 12345 Bad Testcity Teststr. 8 NaN Bad Testcity 12345 2 17.696 102.784 185.824 49/12345/30 O
4025244 12345 Bad Testcity Teststr. 10 NaN Bad Testcity 12345 2 17.696 102.784 NaN 49/12345/30 O
4025245 12345 Bad Testcity Teststr. 11 NaN Bad Testcity 12345 2 17.696 51.392 NaN 49/12345/30 O