填补数据帧中其他行中父级的空白

2024-10-06 12:19:18 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个显示速度限制的数据集。其理念是,每个地区或城市都可以应用自己的规则,或“继承”其父实体的规则

+-------------+---------------------------+---------------------+-----------+
| country     | region                    | city                | max_speed |
+-------------+---------------------------+---------------------+-----------+
| France      |                           |                     | 50        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Grégoire      |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Malo          | 30        |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             |                     |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Saint-Cloud         |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Vélizy-Villacoublay | 50        |
+-------------+---------------------------+---------------------+-----------+
| Germany     |                           |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    | Nürnberg            |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands |                           |                     | 90        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland      |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie   Gelderland    | Harderwijk          |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Haarlem             |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Hoorn               | 30        |
+-------------+---------------------------+---------------------+-----------+

每当max_speed值丢失时,应将其推断为父级的值。例如,圣格雷戈瓦的限速是布列塔涅的限速,而哈德维克纽伦堡则适用该国的规则(分别为90和70)

因此,考虑到这个DataFrame

data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}

speed_limits = pd.DataFrame(data)

如何填写max_speed中缺少的值以获得:

+-------------+-------------------------+---------------------+-----------+
| country     | region                  | city                | max_speed |
+-------------+-------------------------+---------------------+-----------+
| France      |                         |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Grégoire      |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Malo          |        30 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Saint-Cloud         |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Vélizy-Villacoublay |        50 |
+-------------+-------------------------+---------------------+-----------+
| Germany     |                         |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  | Nürnberg            |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands |                         |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    | Harderwijk          |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem             |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn               |        30 |
+-------------+-------------------------+---------------------+-----------+

我一直在尝试创建一个函数来应用于max_speed==np.NaN的每一行,检索它的父行(在确定缺少的值是否适用于某个地区或城市之后),然后返回它的max_speed值,但是,除了在这方面不是很成功之外,我甚至不确定这是最明智的方法

有什么想法吗


Tags: lenonedemaxspeedfrancegermanyholland
2条回答

利用ffill()完成工作。首先垂直宣传国家和地区限速,并设置仅限城市的限速栏。然后从左向右传播速度限制,以获得继承的最大速度限制

创建工作数据框:

wf = speed_limits.copy()

复制并宣传国家/地区的车速限制:

wf['cntry_spd'] = pd.Series(np.where(wf['region'], np.nan, wf['max_speed'])).ffill()

复制区域速度限制并在区域内传播:

wf['reg_spd'] = np.where(~wf['region'].isna() & wf['city'].isna(), wf['max_speed'], np.nan)
wf['reg_spd'] = wf.groupby(['country','region'])['reg_spd'].ffill() 

创建仅限城市的速度限制列:

wf['city_spd'] = np.where(~wf['city'].isna(), wf['max_speed'], np.nan)

通过在cntry_spdreg_spdcity_spd列上从左到右向前填充NA,在speed_limits DF上设置max_speed列,继承尚未设置的速度限制:

speed_limits['max_speed'] = wf[['cntry_spd','reg_spd','city_spd']].ffill(axis=1)['city_spd']

结果:

        country                   region                 city  max_speed
0        France                     None                 None       50.0
1        France                 Bretagne                 None       70.0
2        France                 Bretagne       Saint-Grégoire       70.0
3        France                 Bretagne           Saint-Malo       30.0
4        France            Île-de-France                 None       50.0
5        France            Île-de-France          Saint-Cloud       50.0
6        France            Île-de-France  Vélizy-Villacoublay       50.0
7       Germany                     None                 None       70.0
8       Germany                   Bayern                 None       70.0
9       Germany                   Bayern             Nürnberg       70.0
10  Netherlands                     None                 None       90.0
11  Netherlands     Provincie Gelderland                 None       90.0
12  Netherlands     Provincie Gelderland           Harderwijk       90.0
13  Netherlands  Provincie Noord-Holland                 None       70.0
14  Netherlands  Provincie Noord-Holland              Haarlem       70.0
15  Netherlands  Provincie Noord-Holland                Hoorn       30.0

这是对我的审判。 有文档记录和一些print()用于调试。 query()语句基于pandas/NumPy使用np.nan!=np.nan,并且对待任何人都不像np.nan。 请参阅本页上的注释/警告之一https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

import pandas  as pd
data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}

df = pd.DataFrame(data)

#1)split the initial df in multiple dfs, using df.query():
# - countries - we assume that all of them have max_speed
# - regions - two categories
#   - max speed set
#   - max speed unset
# - cities - to categories 
#   - max speed set
#   - max speed unset
#
#2) use merge/join to update the max speed for the categories with max speed unset
#
#3) use append to cncatenate all sets, this is final result 
# replaced None/nan wit empty string for nice printing

# those will have speed set
# city compare is superflue, but for consistency
df_countries_only = df.query("(region != region) and (city != city) ")
print(df_countries_only)

# fix the regions
df_regions_to_fix = df.query("(city != city) and (max_speed != max_speed) and (region == region)")
df_regions_ok = df.query("(city != city) and (max_speed == max_speed) and (region == region)")

df_regions_speed = pd.merge(df_countries_only.drop(['region', 'city'], axis=1), 
        df_regions_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country"])
df_regions_speed = df_regions_speed.append(df_regions_ok)
print(df_regions_speed)

df_cities_to_fix = df.query("(city == city) and (max_speed != max_speed)")
df_cities_ok = df.query("(city == city) and (max_speed == max_speed)")

df_cities_speed = pd.merge(df_regions_speed.drop(['city'], axis=1), 
        df_cities_to_fix.drop(['max_speed'], axis=1), how="inner", on=["country", "region"])

print(df_cities_speed)

# now rebuild final df
df_all_data = df_cities_speed.append(df_cities_ok).append(df_regions_speed).append(df_countries_only)
print("\n\n")
print(df_all_data.sort_values(by=['country', 'region', 'city']).fillna("")[['country', 'region', 'city', 'max_speed']])

相关问题 更多 >