修改大型数据帧中的值的最有效方法

2024-07-04 06:05:04 发布

您现在位置:Python中文网/ 问答频道 /正文

概述:我正在处理人口普查信息的数据帧,虽然它们只有两列,但它们的长度为几十万行。一列是普查区块ID号,另一列是“地点”值,该值对于该普查区块ID所在的城市是唯一的

示例数据:

    BLOCKID          PLACEFP
0    60014001001000  53000
1    60014001001001  53000
...
5844 60014099004021  53000
5845 60014100001000    
5846 60014100001001
5847 60014100001002  53000     

问题:如上所示,有几个位置值为空,尽管它们在对应的行中有人口普查块ID。我发现,在一些情况下,缺少位置值的人口普查区块ID与周围没有缺少位置值的区块位于同一个城市内,特别是如果bookend place值相同-如上所示,对于索引5844到5847,这两个块位于与周围块相同的一般区域内,但似乎缺少位置值

目标:我希望能够遍历这个数据帧,找到这些实例并根据缺失值之前的位置值和紧跟其后的位置值填充缺失的位置值

现状与未来;障碍:我编写了一个循环,通过数据帧来纠正这些问题,如下所示

current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006], 
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})

for i in current_state_blockid_df.index:
    if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
        #Get value before blank
        prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
        next_place_fp = ''
        _n = 1

        # Find the end of the blank section
        while next_place_fp == '':
            next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
            if next_place_fp == '':
                _n += 1

        # if the blanks could likely be in the same city, assign them the city's place value
        if prior_place_fp == next_place_fp:
            for _i in range(1, _n):
                current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp

但是,正如预期的那样,它在处理数十万或数十万行数据时非常缓慢。我曾考虑过使用ThreadPool executor来分割工作,但我还没有完全弄清楚我将使用什么逻辑来完成这项工作。稍微加快速度的一种可能性是取消检查,以查看间隙的终点在哪里,而只是用空格之前的前一个位置值填充它。虽然这可能最终成为我的目标,但它仍然有可能太慢,理想情况下,我希望它只在前后值匹配时填充,从而消除错误分配块的可能性。如果有人对如何快速实现这一目标有其他建议,我们将不胜感激


Tags: the数据iddfifplace区块current
2条回答

您可以使用shift来帮助加快流程。然而,这并不能解决一行中有多个空格的情况

df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1) 
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)

criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']

如果最终需要在数据帧上迭代,请使用df.itertuples。您可以通过点符号(row.column_name)访问行中的列值

for idx, row in df.itertuples():
    # logic goes here

使用定义的数据帧

def fix_df(current_state_blockid_df):
    df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
    df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
    sections = {}
    last_i = 0
    grouping = []

    for i in df_with_blanks.index:
        if i - 1 == last_i:
            grouping.append(i)
            last_i = i
        else:
            last_i = i
            if len(grouping) > 0:
                sections[min(grouping)] = {'indexes': grouping}

            grouping = []
            grouping.append(i)

    if len(grouping) > 0:
        sections[min(grouping)] = {'indexes': grouping}

    for i in sections.keys():
        sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']

    l = []

    for i in sections:
        for x in sections[i]['indexes']:
            l.append(sections[i]['place'])

    df_with_blanks['PLACEFP'] = l
    final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
    return final_df

df = fix_df(current_state_blockid_df)
print(df)

输出:

     BLOCKID PLACEFP
0   60014099004021   53000
1   60014100001000   53000
2   60014100001001   53000
3   60014100001002   53000
4   60014301012019   11964
5   60014301013000   11964
6   60014301013001   11964
7   60014301013002   11964
8   60014301013003   11964
9   60014301013004   11964
10  60014301013005   11964
11  60014301013006   11964

相关问题 更多 >

    热门问题