只从python列中给出的地址中提取城市和pincode

2024-09-28 21:02:29 发布

您现在位置:Python中文网/ 问答频道 /正文

如何从给定的地址和特定列中仅提取pin码和城市,并将其分配到名为“城市”和“地址”的两个新列中。这与python中的regex配合得很好,有没有其他快速运行的方法,因为10000行需要6分钟以上的时间。你知道吗

地址示例:87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 110059 Delhi

pincoderegex=re.compile(r'([\w]*)[\s]([\d]{6})')
pincoderegex.search(ref).group()  --- > o/p : 'Delhi 110059'
pincoderegex.search(data_rnr['BORROWER ADDRESS'][80]).groups()[1] ---> o/p:'700105'
data_rnr['BORROWER CITY_NAME']='default value'
data_rnr['BORROWER CITY_PINCODE']='default value'
for i in range(0,len(data_rnr['BORROWER ADDRESS'])):
    try:
        data_rnr['BORROWER CITY_NAME'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[0]
        data_rnr['BORROWER CITY_PINCODE'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[1]
    except TypeError:
        print('TypeError')
    except NameError:
        print('NameError')
    except AttributeError:
        print('AttributeError')
    except:
        pass

输出将添加到新的Df列data_rnr['BORROWER CITY_NAME']data_rnr['BORROWER CITY_PINCODE']


Tags: namedefaultcitysearchdataaddress地址groups
2条回答

根据@Olivier Hao给出的最佳模式的答案:\s([\w]+)\s([\d]{6}),您可以只使用Pandas获得更快的单行代码:

pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)

请注意,我在regex模式中直接命名了组来创建新列。你知道吗

代码的唯一区别是,在新的create列中没有default value,而是在找不到模式的地方有NaN个值。你知道吗

我使用了以下数据样本:

data = [
    "87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 110059 Delhi",
    "87 F/F Place Opp. C-2, Uttam Nagar NA Paris 930000 Paris",
    "87 F/F Place Opp. C-2, Uttam Nagar NA Somewhere 115800 Somewhere",
    "Wrong stuff",
    "87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 148444 Bombay",
]

使用您的代码,在更改模式并删除需要大量计算时间的打印后,我得到以下结果:

def regex():
    data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
    pincoderegex=re.compile(r'\s([\w]+)\s([\d]{6})')
    data_rnr['BORROWER CITY_NAME']='default value'
    data_rnr['BORROWER CITY_PINCODE']='default value'
    for i in range(0,len(data_rnr['BORROWER ADDRESS'])):
        try:
            data_rnr['BORROWER CITY_NAME'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[0]
            data_rnr['BORROWER CITY_PINCODE'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[1]
        except:
            pass
    return data_rnr
%timeit regex()

2.1 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

                                    BORROWER ADDRESS BORROWER CITY_NAME BORROWER CITY_PINCODE
0  87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11...              Delhi                110059
1  87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93...              Paris                930000
2  87 F/F Place Opp. C-2, Uttam Nagar NA Somewher...          Somewhere                115800
3                                        Wrong stuff      default value         default value
4  87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1...             Bombay                148444

使用单行代码我得到了这个结果:

def pandasExtract():
    data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
    return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)
%timeit pandasExtract()

1.1 ms ± 6.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

                                    BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0  87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11...              Delhi                110059
1  87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93...              Paris                930000
2  87 F/F Place Opp. C-2, Uttam Nagar NA Somewher...          Somewhere                115800
3                                        Wrong stuff                NaN                   NaN
4  87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1...             Bombay                148444

但是,如果您绝对希望填充NaN值,则需要更多的时间(仍然比代码快):

def pandasExtractWithoutNan():
   data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
   return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})').fillna('default value')], axis=1)
%timeit pandasExtractWithoutNan()

1.57 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

                                    BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0  87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11...              Delhi                110059
1  87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93...              Paris                930000
2  87 F/F Place Opp. C-2, Uttam Nagar NA Somewher...          Somewhere                115800
3                                        Wrong stuff      default value         default value
4  87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1...             Bombay                148444

我使用的函数的文档:

str.extract: extract the patterns found in the Series.

fillna: fill the missing values by the value given.

concat: concat a list of DataFrames on the axis given.

  • ([\w]*)[\s]([\d]{6})需要398个步骤
  • ([\w]+)\s([\d]{6})需要290个步骤
  • \b([\w]+)\s([\d]{6})需要174个步骤
  • \s([\w]+)\s([\d]{6})需要131个步骤

因此,您可以使用\s([\w]+)\s([\d]{6})来提高效率

https://regex101.com/r/iLIXDI/1

相关问题 更多 >