删除具有复杂逻辑的嵌套循环的最佳方法

2024-10-05 14:26:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个程序,它将属性的电子表格读入一个数据框,然后查询SQL数据库并生成另一个数据框,然后对这两个数据框运行余弦相似性函数,告诉我电子表格中的哪些地址在我的数据库中

下面是我的余弦相似性函数的代码,以及一些辅助函数。我的问题是,在一张有成百上千个地址的表上,速度非常慢,因为它使用嵌套的for循环为每个地址创建一个最佳相似性列表

import string
import math
import re
from collections import Counter

WORD = re.compile(r"\w+")
    
def clean_address(text):
  text = ''.join([word for word in text if word not in string.punctuation])
  text = text.lower()
  return text

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)
  
def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
    
def getCosineSimilarities(internalDataframe, externalDataframe):
    similarities = []
    internalAddressColumn = internalDataframe['Address']
    internalPostcodeColumn = internalDataframe['postcode']
    externalAddressColumn = externalDataframe['full address']
    externalPostcodeColumn = externalDataframe['postcode']

    for i in range(len(internalDataframe)):
        bestSimilarity = 0
        for j in range(len(externalDataframe)):
            if internalPostcodeColumn.iloc[i].rstrip() == externalPostcodeColumn.iloc[j]:
                vector1 = text_to_vector(clean_address(internalAddressColumn.iloc[i]))
                vector2 = text_to_vector(clean_address(externalAddressColumn.iloc[j]))
                cosine = get_cosine(vector1, vector2)
                if cosine > bestSimilarity:
                    bestSimilarity = cosine
        similarities.append(bestSimilarity)
    
    return similarities

我确信一定可以使用列表理解或类似的方法创建由GetCosinessilarities返回的“相似性”列表,但我无法找到最好的方法

有人能帮忙吗

编辑: internalDataframe.head(5)

     Name              postcode    Created  
0    Mr Joe Bloggs     SW6 6RD     2020-10-21 14:15:58.140            
1    Mrs Joanne Bloggs SE17 1LN    2013-06-27 14:52:29.417
2    Mr John Doe       SW17 0LN    2017-02-23 16:22:03.630
3    Mrs Joanne Doe    SW6 7JX     2019-07-03 14:52:00.773
4    Mr Joe Public     W5 2RX      2012-11-19 10:28:47.863

externalDataframe.head(5)

address_id  category beds postcode 
1005214     FLA      2    NW5 4DA  
1009390     FLA      2    NW5 1PB  
1053948     FLA      2    NW6 3SJ  
1075629     FLA      2    NW6 7UP
1084325     FLA      2    NW6 7YQ 

Tags: textinimportforreturnifaddressdef
3条回答

正如您所说,这里的问题是嵌套循环。 对于internalDataframe中的每个项目,您在externalDataframe上执行了许多代价高昂的操作:

  • text_to_vector涉及regexfindallcounter创建。您可以将externalDataframe中的值记忆起来,并相应地修改您的函数
  • get_cosine涉及对{}和{}中所有项的幂的强制转换。同样,您可以将externalDataframe中的值记忆起来,并相应地修改您的函数。在这种情况下,您可能还需要记录internalDataframe的结果
  • 不太重要的是,for x in list(vec1.keys())是多余的:强制将dict_keys转换为list(一次迭代),然后在该list上迭代(另一次迭代)。只要做for x in vec1.keys()
  • 更不重要的是,在计算它们的平方根乘积之前,您可以检查sum1sum2中的一个是否为零,而不是检查该乘积是否为零

似乎需要距离矩阵之类的东西。基于this SO answer,这是如何比较两个数据帧列中所有字符串对的示意图:

import pandas as pd
import numpy as np
from collections import Counter
import math

def text2vec(text):
    # just a naive transformation
    return Counter(text.split())

def get_cosine(text1, text2):
    """Modified version of your function – you might want to improve 
       it some more following gimix's advice or even better, make 
       full use of numpy arrays"""
    vec1, vec2 = text2vec(text1), text2vec(text2)
    
    intersection = set(vec1) & set(vec2)
    numerator = sum(vec1[x] * vec2[x] for x in intersection)

    sum1 = sum(v ** 2 for v in vec1.values())
    sum2 = sum(v ** 2 for v in vec2.values())
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
    
# Makes your function vector-ready
cos_sim = np.vectorize(get_cosine)

# Some pseudo data
data = {"address":["An address in some city", 
                   "Cool location in some town", 
                   "100 places to see before you die"]}
data2 = {"address":["Disney world", 
                    "An address in some city", 
                    "500 places to see before you die", 
                    "Neat location in some town"]}

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

# Compare all combinations and combine to a new dataframe
# This is 1:1 adopted from the answer linked above
cos_matrix = cos_sim(df.address.values, df2.address.values[:, None])
result_df = pd.concat((df2.address, 
                       pd.DataFrame(cos_matrix, 
                                    columns=df.address)), 
                       axis=1)

print(result_df)

它提供了所有值,您可以使用max获得最佳值:

                            address  An address in some...   Cool location in...  100 places to see...
0                      Disney world                    0.0                   0.0              0.000000
1           An address in some city                    1.0                   0.4              0.000000
2  500 places to see before you die                    0.0                   0.0              0.857143
3        Neat location in some town                    0.4                   0.8              0.000000

@SNygard在这里值得称赞,因为他的评论引导我走上了正确的方向(如果其他两个答案有帮助的话,其他人需要说——我沿着这条路开车,没有回头看)

我在internalDataFrame中创建了一列以保持索引作为可用值,创建了一个字典来存储每个索引的最佳相似性(从0开始),然后按照建议合并了两个数据帧。这意味着我只需要在合并的数据帧中循环一次,并在相关的地方更新相似性字典

它将处理相似性的时间从500个地址的externalDataFrame的约15秒减少到0.5秒以下,我也在4.5秒内对6000个地址的externalDataFrame运行了它,这是我无法与之相比的,因为在上一个版本上处理它通常需要数小时

def getCosineSimilarities(internalDataframe, externalDataframe):
    
    internalDataframe['index'] = internalDataframe.index    
    combinedDf = pd.merge(internalDataframe, externalDataframe, on='postcode')
    similarities_dict = dict() 
    
    for i in range(len(internalDataframe)):
        index = internalDataframe['index'].iloc[i]
        similarities_dict[index] = 0   
        
    for i in range(len(combinedDf)):
        vector1 = text_to_vector(clean_address(combinedDf['Address'].iloc[i]))
        vector2 = text_to_vector(clean_address(combinedDf['full address'].iloc[i]))
        cosine = get_cosine(vector1, vector2)
        index = combinedDf['index'].iloc[i]
        if cosine > similarities_dict[index]:
            similarities_dict[index] = cosine       
    
    similarities = []
    
    for key, value in similarities_dict.items():
        similarities.append(value)
    
    return similarities

相关问题 更多 >