在for循环期间，值的长度与索引的长度不匹配

name2_list, score_list = [],[] # Initialize lists. These two lists need to store outputs from my function name2 = df['name2'] # Append new name2 to this list as I iterate name1 = df['name1'] # Append new name1 to this list as I iterate distinct_name1 = set(name1) # distinct name1. I need this to calculate the difference diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0 if df.Name2.isnull().all(): # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2 if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round for x in diff: # I run it first for John, then for Patty collected_data = fun(df, diff) # I will explain below what this function does and how it looks like df = df.apply(pd.Series.explode) # in this step I explode the dataset name2 = df['Name2'] # I am updating the list of values in Name2 to calculate the difference after each iteration. name1 = df['Name1'] # I am updating the list of values in Name1 to calculate the difference after each iteration. distinct_name1 = set(name1) # calculate the new difference diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty

1条回答

网友

1楼 · 发布于 2024-09-27 04:25:18

我认为用熊猫来解决这类问题不是一个好主意。如果您对普通python的中间步骤没有问题，可以这样做：

import pandas as pd


def get_links(source_name):
    """Dummy function with data from OP.
    
    Note that it processes one name at a time instead of batch like in OP.
    """
    dummy_output = {
        'John': (
            ['Chris', 'Luke', 'Martin'],
            [1, 2, 4]
        ),
        'Patty': (
            ['Martin'],
            [9]
        ),
        'Chris': (
            ['Patty'],
            [9]
        ),
        'Luke': (
            ['Martin'],
            [1]
        ),
        'Martin': (
            ['Laura'],
            [3]
        ),
        'Laura': (
            ['John'],
            [3]
        )
    }
    target_names, scores = dummy_output.get(source_name, ([], []))

    return [
        {'name1': source_name, 'name2': target_name, 'score': score}
        for target_name, score in zip(target_names, scores)
    ]


todo = ['John', 'Patty']

seen = set(todo)
data = []

while todo:
    source_name = todo.pop(0)  # If you don't care about order can .pop() to get last element (more efficient)
    # get new data
    new_data = get_links(source_name)
    data += new_data

    # add new names to queue if we haven't seen them before
    new_names = set([row['name2'] for row in new_data]).difference(seen)
    seen.update(new_names)
    todo += list(new_names)

pd.DataFrame(data)

输出：

    name1   name2  score
0    John   Chris      1
1    John    Luke      2
2    John  Martin      4
3   Patty  Martin      9
4   Chris   Patty      9
5    Luke  Martin      1
6  Martin   Laura      3
7   Laura    John      3

相关问题更多 >

编程相关推荐

热门问题

热门文章