在for循环期间,值的长度与索引的长度不匹配

2024-09-27 04:25:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个数据集(df

Name1 Name2 Score
John    NaN  NaN
Patty    NaN  NaN

其中Name2Score被初始化为NaN。一些数据,如下所示

name2_list=[[Chris, Luke, Martin], [Martin]]
score_list=[[1,2,4],[3],[]]

在函数的每个循环处生成。这两个列表需要添加到mydf中的Name2Score列中,以便:

Name1 Name2         Score
John    [Chris, Luke, Martin]  [1,2,4]
Patty    [Martin]  [3]

然后,因为我希望在Name2Score中有值而不是列表,所以我展开数据集:

Name1 Name2  Name3
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3

我的目标是让Name2中的所有值都在Name1中。但是,正如我所提到的,我有一个如下工作的函数:对于Name2中的每个元素,而不是Name1中的每个元素,它检查是否还有其他值。生成的这些值与name2_listscore_list的值类似。 例如,假设在第二次迭代中,Chris从函数生成的值等于[Patty]9Luke有值[Martin]1Martin有值[Laura]3。然后,我需要将这些值再次添加到我的原始df,以便(在分解之前)具有

Name1 Name2  Score
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3
Chris   Patty    9
Luke    Martin   1
Martin  Laura    3

只有一个值Laura不在Name1中,因此我需要再次运行该函数:如果输出已经包含在Name1中,则我的循环停止,并获得最终的数据集;否则,我需要重新运行函数,看看是否需要更多的循环。 为了缩短本例中的时间,我们假设运行函数后Laura的值为John3John已经在Name1中,因此我不需要重新运行该函数

我所做的工作如下:

name2_list, score_list = [],[]   # Initialize lists. These two lists need to store outputs from my function

name2 = df['name2']              # Append new name2 to this list as I iterate
name1 = df['name1']              # Append new name1 to this list as I iterate
distinct_name1 = set(name1)      # distinct name1. I need this to calculate the difference
diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0


if df.Name2.isnull().all():  # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2

    if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round 
         for x in diff: # I run it first for John, then for Patty
            collected_data = fun(df, diff) # I will explain below what this function does and how it looks like
    
        df = df.apply(pd.Series.explode) # in this step I explode the dataset

        name2 = df['Name2']             # I am updating the list of values in Name2 to calculate the difference after each iteration. 
        name1 = df['Name1']             # I am updating the list of values in Name1 to calculate the difference after each iteration. 
        distinct_name1 = set(name1)    # calculate the new difference
        diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty 

当在函数

中考虑此步骤^ {< CD34>}时发生错误

---> 33 df['Name2'] = name2_list

说:

ValueError: Length of values (6) does not match length of index (8).

(圆括号内的值可能不同于通过使用此示例获得的值)

我的函数目前不关心数据帧中有多少行,它正在创建不同长度的新列表。我需要找到一种方法来调和这一点。我正在调试,我可以确认错误来自函数中的df['Name2'] = name2_list。我能够正确打印新名称2值的列表,但不能打印列。 也许,一个可能的解决方案是在for循环之外构建一次df,但我需要分解df['Name2']并构建用于存储web结果的列表


Tags: theto函数dfthisjohnlistchris
1条回答
网友
1楼 · 发布于 2024-09-27 04:25:18

我认为用熊猫来解决这类问题不是一个好主意。如果您对普通python的中间步骤没有问题,可以这样做:

import pandas as pd


def get_links(source_name):
    """Dummy function with data from OP.
    
    Note that it processes one name at a time instead of batch like in OP.
    """
    dummy_output = {
        'John': (
            ['Chris', 'Luke', 'Martin'],
            [1, 2, 4]
        ),
        'Patty': (
            ['Martin'],
            [9]
        ),
        'Chris': (
            ['Patty'],
            [9]
        ),
        'Luke': (
            ['Martin'],
            [1]
        ),
        'Martin': (
            ['Laura'],
            [3]
        ),
        'Laura': (
            ['John'],
            [3]
        )
    }
    target_names, scores = dummy_output.get(source_name, ([], []))

    return [
        {'name1': source_name, 'name2': target_name, 'score': score}
        for target_name, score in zip(target_names, scores)
    ]


todo = ['John', 'Patty']

seen = set(todo)
data = []

while todo:
    source_name = todo.pop(0)  # If you don't care about order can .pop() to get last element (more efficient)
    # get new data
    new_data = get_links(source_name)
    data += new_data

    # add new names to queue if we haven't seen them before
    new_names = set([row['name2'] for row in new_data]).difference(seen)
    seen.update(new_names)
    todo += list(new_names)

pd.DataFrame(data)

输出:

    name1   name2  score
0    John   Chris      1
1    John    Luke      2
2    John  Martin      4
3   Patty  Martin      9
4   Chris   Patty      9
5    Luke  Martin      1
6  Martin   Laura      3
7   Laura    John      3

相关问题 更多 >

    热门问题