假设我有以下3个数据帧:
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter
df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
[190010, 'Shenzhen', 'Ping An Finance Centre'],
[190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
[190015, 'Beijing', 'China Zun']]),
columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
['ZY-15', 'Hong Kong', 'International Commerce Centre'],
['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
columns=['id', 'city', 'name']
)
我想通过使用fuzzywuzzy
软件包计算它们的相似性来找到相似的建筑名称,以下是我需要改进的解决方案:
首先,我将所有三个数据帧连接到一个列中,作为full_name
。实际上,在这一步,我不应该将id
添加到full_name
中,但是为了更好地区分不同数据帧中的建筑名称,我添加了它:
其次,我迭代所有full_names
,并相互比较,得到每对建筑名称的similarity_ratio
:
df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()
processedProjects = []
matchers = []
threshold_ratio = 10
for project in projects:
if project:
processedProject = fuzz._process_and_sort(project, True, True)
processedProjects.append(processedProject)
matchers.append(fuzz.SequenceMatcher(None, processedProject))
with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter=',', lineterminator='\n', )
writer.writerow(('name', 'matched_name', 'similarity_ratio'))
for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
matcher = matchers[project1[0]]
matcher.set_seq2(project2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(projects[project1[0]], projects[project2[0]])
my_list = projects[project1[0]], projects[project2[0]], ratio
print(my_list)
writer.writerow(my_list)
my_list
结果:
('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......
最后一步,我在Excel中手动拆分output10.csv
,得到这样的最终预期结果(如果每个建筑都有数据帧源,那会更好):
id city name matched_id matched_name \
0 1010667747 Suzhou Suzhou IFS 1010667356 Shenzhen
1 1010667747 Suzhou Suzhou IFS 1010667289 Wuhan
2 1010667747 Suzhou Suzhou IFS 190010 Shenzhen
3 1010667747 Suzhou Suzhou IFS 190012 Guangzhou
4 1010667747 Suzhou Suzhou IFS 190015 Beijing
matched_name.1 similarity_ratio
0 Kingkey 100 44
1 Wuhan Center 49
2 Ping An Finance Centre 33
3 Guangzhou CTF Finance Centre 47
4 China Zun 27
如何在Python中以更高效的方式获得最终的预期结果?谢谢。在
试试这个解决方案:我使用numpy和itertools来加速和简化编码,而不需要使用excel文件。。。在
相关问题 更多 >
编程相关推荐