返回两个字符串的字符差异

2024-10-01 07:44:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用一个包含大约40万行预处理字符串的数据集。你知道吗

[In]:
raw                                preprocessed

helpersstreet 46, second floor     helpersstreet 46

489 john doe route                 john doe route

at main street 49                  main street

列“preprocessed”中的所有字符串的大小或小于列“raw”。有没有一种快速的方法来比较这些字符串并返回所有差异,将它们放入一列中:

[Out]:
raw                                preprocessed        difference

helpersstreet 46, second floor     helpersstreet 46    ,second floor

489 john doe route                 john doe route      489

at main street 49                  main street         at 49

我真的不知道如何做到这一点,但我也想知道这是否是一种方式去。我可以访问执行预处理的函数,所以修改它们以返回这些值是更快的方法,还是以后创建差异的可伸缩方法。我更喜欢后者。你知道吗


Tags: 数据方法字符串streetrawmain差异john
1条回答
网友
1楼 · 发布于 2024-10-01 07:44:48

选项1
似乎需要迭代替换。您可以使用列表理解来实现这一点:

df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]

df    
                              raw      preprocessed      difference
0  helpersstreet 46, second floor  helpersstreet 46  , second floor
1              489 john doe route    john doe route            489 
2               at main street 49       main street          at  49

考虑到这个问题的局限性(将替换操作矢量化所涉及的困难),我认为这是您最快的选择。你知道吗


选项2
或者,np.vectorize一个lambda

f = np.vectorize(lambda i, j: i.replace(j, ''))
df['difference'] = f(df.raw, df.preprocessed)    

df    
                              raw      preprocessed      difference
0  helpersstreet 46, second floor  helpersstreet 46  , second floor
1              489 john doe route    john doe route            489 
2               at main street 49       main street          at  49

请注意,这只会隐藏循环,如果不是更糟的话,它与选项1一样快/慢。你知道吗


选项3
使用apply,我不建议这样做:

df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1) 

df
                              raw      preprocessed      difference
0  helpersstreet 46, second floor  helpersstreet 46  , second floor
1              489 john doe route    john doe route            489 
2               at main street 49       main street          at  49

隐藏了循环,但要比选项2付出更多的开销。你知道吗


时间安排 应我朋友耶兹雷尔先生的请求:

df = pd.concat([df] * 10000, ignore_index=True)  # setup

# Option 1
%timeit df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
186 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Option 2
%timeit df['difference'] = f(df.raw, df.preprocessed)  
326 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Option 3
%timeit df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1) 
20.8 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

相关问题 更多 >