如何从数据框中的地址列提取英国邮政编码？

col1 col2 0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP 1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY 2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT 3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY 4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA 5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE

df["postcodes"] = df["address"].str.extract(r'^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$')

rhs = (df1.address .apply(lambda x: df2[df2.Postcode.str.find(x).ge(0)]['location']) .bfill(axis=1) .iloc[:, 0]) (pd.concat([df1.app_nbr, rhs], axis=1, ignore_index=True) .rename(columns={0: 'app_nbr', 1: 'location'}))

col1 col2 col3 0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP TN5 4NP 1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY WP6 4TY 2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT SF6 9VT 3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY CM52 0QY 4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA PT7 6WA 5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE CD7 9HE

col1 col2 col3 (coords) 0 1303 674 Yellow Gardens,Tunbridge Wells, Kent TN5 4NP 50.00, 1.00 1 1205 154 Coller Crescent Runcorn,Cheshire WP6 4TY 51.23, 1.05 2 1504 122 Uphill Road,Rayleigh, Essex SF6 9VT 54.65, 1.07 3 1678 67 Lampoon Crescent,Billericay, Essex, CM52 0QY 51.23, 0.95 4 1897 32 Dovelane,Benfleet, Essex, PT7 6WA 54.6, 2.23 5 1654 46, The Clewter,Great Durham, Essex, CD7 9HE 49.25, 1.23

3条回答

网友

1楼 · 编辑于 2024-09-30 20:28:02

如果始终需要最后两个值，请使用“拆分”将字符串转换为列表，并获取列表中的最后两个值

地址=“黄色花园，肯特郡通布里奇威尔斯TN5 4NP”

地址列表=地址拆分（）

Zip=地址列表[len（地址列表）-1]+“”+地址列表[len（地址列表）]

网友

2楼 · 编辑于 2024-09-30 20:28:02

我不知道您的数据有多不规则，您对篡改的容忍度有多高，但面对相当混乱的地址数据，有时您需要一些横向思考。考虑使用谷歌地图API，把地址扔到它，并收回干净的数据使用谷歌的所有智慧。对于170万个地址，您需要支付一点费用，每天的免费配额非常少

网友

3楼 · 编辑于 2024-09-30 20:28:02

尝试使用邮政：https://github.com/openvenues/pypostal

这是一个用于解析地址的开源库

In [1]: from postal.parser import parse_address

In [2]: parse_address("Coller Crescent Runcorn,Cheshire WP6 4TY")
Out[2]:
[('coller crescent', 'road'),
 ('runcorn', 'city'),
 ('cheshire', 'state_district'),
 ('wp6 4ty', 'postcode')]

In [3]: parse_address("Yellow Gardens,Tunbridge Wells, Kent TN5 4NP")
Out[3]:
[('yellow gardens', 'road'),
 ('tunbridge wells', 'city'),
 ('kent', 'state_district'),
 ('tn5 4np', 'postcode')]

而且我认为它将更好地处理真实数据

相关问题更多 >

编程相关推荐

热门问题

热门文章