对列表的dataframe列中的每个元素运行函数。2.

2024-10-02 16:21:19 发布

您现在位置:Python中文网/ 问答频道 /正文

这个问题源于Run a function on each element in a dataframe column of lists,它回答了一个问题,在这个问题中,我有几个函数在列表列中的每个元素上运行,并生成一个分数(func_results),如下所示:

col1             col2                         func_results
0   MAX          [MAX, amx, akd]              [('MAX',1.0),('amx',0.89),('akd',0.56)]
1   Sam          ['Sam','sammy','samsam']     [('Sam',1.0),('sammy',0.91), ('samsam',0.88)]
2   Larry        ['lar','lair','larrylamo']   [('lar',0.91),('larrylamo',0.91), ('lair',0.83)]

此^df的可执行代码-您需要首先从下面运行所有函数:

data = {'col1':  ['MAX', 'Sam', 'Larry'],
        'col2': ["['MAX', 'amx', 'akd']", "['Sam','sammy','samsam']", "['lar','lair','larrylamo']"],
#         'func_results': ["[('MAX',1.0),('amx',0.89),('akd',0.56)]", "[('Sam',1.0),('sammy',0.91), ('samsam',0.88)]", "[('lar',0.91),('larrylamo',0.91), ('lair',0.83)]"]
        }

# df1 = pd.DataFrame (data, columns = ['col1','col2','func_results'])
df1 = pd.DataFrame (data, columns = ['col1','col2'])

df1['col2'] = df1.col2.apply(literal_eval)
df1['func_results'] = df1.agg(lambda x: get_top_matches(*x), axis=1)
df1

现在,当col2不包含任何列表,而每行只包含一个字符串时,我需要运行相同的函数集,就像这样df:

    col1              col2
0   abc co            AAP akj
1   kdj               fuj ddd
2   bac               ADO asd

此df的可执行文件:

data = {'col1':  ['abc co', 'kdj', 'bac'],
        'col2': ['AAP akj', 'fuj ddd', 'ADO asd']
        }
df3 = pd.DataFrame (data, columns = ['col1','col2'])
df3

功能:

#jaro version
def sort_token_alphabetically(word):
    token = re.split('[,. ]', word)
    sorted_token = sorted(token)
    return ' '.join(sorted_token)

def get_jaro_distance(first, second, winkler=True, winkler_ajustment=True,
                      scaling=0.1, sort_tokens=True):
    """
    :param first: word to calculate distance for
    :param second: word to calculate distance with
    :param winkler: same as winkler_ajustment
    :param winkler_ajustment: add an adjustment factor to the Jaro of the distance
    :param scaling: scaling factor for the Winkler adjustment
    :return: Jaro distance adjusted (or not)
    """
    if sort_tokens:
        first = sort_token_alphabetically(first)
        second = sort_token_alphabetically(second)

    if not first or not second:
        raise JaroDistanceException(
            "Cannot calculate distance from NoneType ({0}, {1})".format(
                first.__class__.__name__,
                second.__class__.__name__))

    jaro = _score(first, second)
    cl = min(len(_get_prefix(first, second)), 4)

    if all([winkler, winkler_ajustment]):  # 0.1 as scaling factor
        return round((jaro + (scaling * cl * (1.0 - jaro))) * 100.0) / 100.0

    return jaro

def _score(first, second):
    shorter, longer = first.lower(), second.lower()

    if len(first) > len(second):
        longer, shorter = shorter, longer

    m1 = _get_matching_characters(shorter, longer)
    m2 = _get_matching_characters(longer, shorter)

    if len(m1) == 0 or len(m2) == 0:
        return 0.0

    return (float(len(m1)) / len(shorter) +
            float(len(m2)) / len(longer) +
            float(len(m1) - _transpositions(m1, m2)) / len(m1)) / 3.0

def _get_diff_index(first, second):
    if first == second:
        pass

    if not first or not second:
        return 0

    max_len = min(len(first), len(second))
    for i in range(0, max_len):
        if not first[i] == second[i]:
            return i

    return max_len

def _get_prefix(first, second):
    if not first or not second:
        return ""

    index = _get_diff_index(first, second)
    if index == -1:
        return first

    elif index == 0:
        return ""

    else:
        return first[0:index]

def _get_matching_characters(first, second):
    common = []
    limit = math.floor(min(len(first), len(second)) / 2)

    for i, l in enumerate(first):
        left, right = int(max(0, i - limit)), int(
            min(i + limit + 1, len(second)))
        if l in second[left:right]:
            common.append(l)
            second = second[0:second.index(l)] + '*' + second[
                                                       second.index(l) + 1:]

    return ''.join(common)

def _transpositions(first, second):
    return math.floor(
        len([(f, s) for f, s in zip(first, second) if not f == s]) / 2.0)

def get_top_matches(reference, value_list, max_results=None):
    scores = []
    if not max_results:
        max_results = len(value_list)
    for val in value_list:
        score_sorted = get_jaro_distance(reference, val)
        score_unsorted = get_jaro_distance(reference, val, sort_tokens=False)
        scores.append((val, max(score_sorted, score_unsorted)))
    scores.sort(key=lambda x: x[1], reverse=True)

    return scores[:max_results]

class JaroDistanceException(Exception):
    def __init__(self, message):
        super(Exception, self).__init__(message)

我只是想让它在col2不是列表,而是每行一个字符串时运行,并在df中生成一个func_results

有什么想法吗


Tags: getindexlenreturnifdefnotresults
1条回答
网友
1楼 · 发布于 2024-10-02 16:21:19

如果需要将col2作为一个字符串的列表,则需要将col2的每个单元格包装到列表中,并调用get_top_matches,如下所示:

df3['col2'] = df3.col2.map(lambda x: [x])
df3['func_results'] = df3.agg(lambda x: get_top_matches(*x), axis=1)

Out[360]:
     col1       col2       func_results
0  abc co  [AAP akj]  [(AAP akj, 0.54)]
1     kdj  [fuj ddd]  [(fuj ddd, 0.49)]
2     bac  [ADO asd]  [(ADO asd, 0.49)]

相关问题 更多 >