基于Levenshtein distan从列表中筛选重复项

2024-09-30 16:24:29 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个json列表,如示例中所示。在那些具有重复的title属性(通过对Levenshtein距离的某个阈值进行评分来确定)的属性中,我想过滤掉在另一个属性(sourceRank)中没有最小值的重复项

这里是我的想法如何做到这一点,然而,索引是打破。最有效的方法是什么

articles = [
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0},
    {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0},
    {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0},
]

print len(articles)
print [a['_source']['title'] for a in articles]

def levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

indices = []
for i1, a1 in enumerate(articles):
    for i2, a2 in enumerate(articles):
        if levenshtein_distance(a1['_source']['title'], a2['_source']['title']) > .8:
            if a1['_source']['sourceRank'] > a2['_source']['sourceRank']:
                indices += [i1]
            else:
                indices += [i2]
articles = [i for j, i in enumerate(articles) if j not in indices]

print len(articles)
print [a['_source']['title'] for a in articles]

Tags: andinsourceappleforleniftitle
1条回答
网友
1楼 · 发布于 2024-09-30 16:24:29

问题的要点似乎是从列表中删除重复的标题,同时确保剩余的标题具有最低的sourceRank。 我不知道sourRank的值可能有多高,所以我就试着用100000作为哨兵值

#!/usr/bin/env python3

import itertools


articles = [
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0}},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0}},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0}},
    {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0}},
    {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0}}
]

def reducer(iter_):
    max_rank = 100000
    retval = None
    for value in iter_:
        current_rank = value["_source"]["sourceRank"]
        if current_rank < max_rank:
            max_rank = current_rank
            retval = value
    return retval


for title, _source in itertools.groupby(articles, lambda x: x["_source"].get("title")):
    print(reducer(_source))

相关问题 更多 >