<p>根据@Olivier Hao给出的最佳模式的答案:<code>\s([\w]+)\s([\d]{6})</code>,您可以只使用Pandas获得更快的单行代码:</p>
<pre><code>pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)
</code></pre>
<p>请注意,我在regex模式中直接命名了组来创建新列。你知道吗</p>
<p>代码的唯一区别是,在新的create列中没有<code>default value</code>,而是在找不到模式的地方有<code>NaN</code>个值。你知道吗</p>
<p>我使用了以下数据样本:</p>
<pre><code>data = [
"87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 110059 Delhi",
"87 F/F Place Opp. C-2, Uttam Nagar NA Paris 930000 Paris",
"87 F/F Place Opp. C-2, Uttam Nagar NA Somewhere 115800 Somewhere",
"Wrong stuff",
"87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 148444 Bombay",
]
</code></pre>
<p>使用您的代码,在更改模式并删除需要大量计算时间的打印后,我得到以下结果:</p>
<pre><code>def regex():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
pincoderegex=re.compile(r'\s([\w]+)\s([\d]{6})')
data_rnr['BORROWER CITY_NAME']='default value'
data_rnr['BORROWER CITY_PINCODE']='default value'
for i in range(0,len(data_rnr['BORROWER ADDRESS'])):
try:
data_rnr['BORROWER CITY_NAME'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[0]
data_rnr['BORROWER CITY_PINCODE'][i]=pincoderegex.search(data_rnr['BORROWER ADDRESS'][i]).groups()[1]
except:
pass
return data_rnr
%timeit regex()
2.1 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
BORROWER ADDRESS BORROWER CITY_NAME BORROWER CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff default value default value
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
</code></pre>
<p>使用单行代码我得到了这个结果:</p>
<pre><code>def pandasExtract():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})')], axis=1)
%timeit pandasExtract()
1.1 ms ± 6.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff NaN NaN
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
</code></pre>
<p>但是,如果您绝对希望填充<code>NaN</code>值,则需要更多的时间(仍然比代码快):</p>
<pre><code>def pandasExtractWithoutNan():
data_rnr = pd.DataFrame(data, columns=["BORROWER ADDRESS"])
return pd.concat([data_rnr, data_rnr['BORROWER ADDRESS'].str.extract(r'\s(?P<BORROWER_CITY_NAME>[\w]+)\s(?P<BORROWER_CITY_PINCODE>[\d]{6})').fillna('default value')], axis=1)
%timeit pandasExtractWithoutNan()
1.57 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
BORROWER ADDRESS BORROWER_CITY_NAME BORROWER_CITY_PINCODE
0 87 F/F Place Opp. C-2, Uttam Nagar NA Delhi 11... Delhi 110059
1 87 F/F Place Opp. C-2, Uttam Nagar NA Paris 93... Paris 930000
2 87 F/F Place Opp. C-2, Uttam Nagar NA Somewher... Somewhere 115800
3 Wrong stuff default value default value
4 87 F/F Place Opp. C-2, Uttam Nagar NA Bombay 1... Bombay 148444
</code></pre>
<p>我使用的函数的文档:</p>
<blockquote>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html#pandas.Series.str.extract" rel="nofollow noreferrer">str.extract</a>: extract the patterns found in the Series.</p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html" rel="nofollow noreferrer">fillna</a>: fill the missing values by the value given.</p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html" rel="nofollow noreferrer">concat</a>: concat a list of DataFrames on the axis given.</p>
</blockquote>