将长格式父子数据帧转换为具有完整层次结构的宽格式

2024-09-30 00:37:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下长格式数据框,其中两列包含嵌套的父子层次结构:

parent,child,child_level
d     ,sf   ,x
d     ,st   ,x
d     ,s0   ,x
sf    ,gr4  ,l
sf    ,gr3  ,l
st    ,grd  ,l
st    ,gr9  ,l
s0    ,n7   ,l
s0    ,b12  ,l
grd   ,nyvc ,b
gr3   ,trub2,b
b12   ,ngb2 ,b
b12   ,ggb8 ,b
nyvc  ,xtr2d,i
trub2 ,xtuD ,i
gr4   ,stab3,i
gr9   ,ubc8 ,i
n7    ,ubc2 ,i
ggb8  ,drik2,i

我的目标是将父列和子列取消嵌套为宽格式。列名应与“子级”列中的相应级别相对应:

 ,x ,l  ,b    ,i
d,sf,gr4,NA   ,stab3
d,sf,gr3,trub2,xtuD
d,st,grd,nyvc ,xtr2d
d,st,gr9,NA   ,ubc8
d,s0,n7 ,NA   ,ubc2
d,s0,b12,ngb2 ,NA
d,s0,b12,ggb8 ,drik2

没有表示顺序或级别等级的其他信息,这些信息仅从子列和父列中出现。此外,并非输出表的所有行都将包含所有层次结构级别-这些级别应使用NA填充

EDIT1(澄清):R或Python解决方案可以工作,因此我提供了一个通用输入表(例如,作为csv阅读)。另外,下表是手工制作的——我不知道如何通过编程实现

EDIT2:行没有顺序,即子级可以是任意顺序,因此必须采用某种递归方法


Tags: 顺序sf级别stnab12s0grd
2条回答

如果没有更多的信息,我通常会如何处理这个问题。我展示了一个tidyverse解决方案,但这当然也可以在baser中实现

data <- structure(list(child = structure(c(10L, 11L, 9L, 4L, 5L, 5L, 
6L, 1L, 8L, 13L, 7L, 3L, 16L, 17L, 12L, 15L, 14L, 2L), .Label = c("b12", 
"drik2", "ggb8", "gr4", "grd", "n7", "ngb2", "nyvc", "s0", "sf", 
"st", "stab3", "trub2", "ubc2", "ubc8", "xtr2d", "xtuD"), class = "factor"), 
    parent = structure(c(2L, 2L, 2L, 11L, 11L, 12L, 10L, 10L, 
    7L, 4L, 1L, 1L, 9L, 13L, 5L, 6L, 8L, 3L), .Label = c("b12", 
    "d", "ggb8", "gr3", "gr4", "gr9", "grd", "n7", "nyvc", "s0", 
    "sf", "st", "trub2"), class = "factor"), child_level = structure(c(4L, 
    4L, 4L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    2L, 2L), .Label = c("b", "i", "t", "x"), class = "factor")), class = "data.frame", row.names = c(NA, 
-18L))

library(tidyverse)

pivot <- data %>% mutate(unique = rownames(data)) %>% pivot_wider(id_cols = unique, names_from = child_level, values_from = child) %>% select(!unique)

输入如下所示:

# > data
#    child parent child_level
# 1     sf      d           x
# 2     st      d           x
# 3     s0      d           x
# 4    gr4     sf           t
# 5    grd     sf           t
# 6    grd     st           t
# 7     n7     s0           t
# 8    b12     s0           t
# 9   nyvc    grd           b
# 10 trub2    gr3           b
# 11  ngb2    b12           b
# 12  ggb8    b12           b
# 13 xtr2d   nyvc           i
# 14  xtuD  trub2           i
# 15 stab3    gr4           i
# 16  ubc8    gr9           i
# 17  ubc2     n7           i
# 18 drik2   ggb8           i

它将输出以下内容:

# > pivot
# # A tibble: 18 x 4
#    x     t     b     i    
#    <fct> <fct> <fct> <fct>
#  1 sf    NA    NA    NA   
#  2 st    NA    NA    NA   
#  3 s0    NA    NA    NA   
#  4 NA    gr4   NA    NA   
#  5 NA    grd   NA    NA   
#  6 NA    grd   NA    NA   
#  7 NA    n7    NA    NA   
#  8 NA    b12   NA    NA   
#  9 NA    NA    nyvc  NA   
# 10 NA    NA    trub2 NA   
# 11 NA    NA    ngb2  NA   
# 12 NA    NA    ggb8  NA   
# 13 NA    NA    NA    xtr2d
# 14 NA    NA    NA    xtuD 
# 15 NA    NA    NA    stab3
# 16 NA    NA    NA    ubc8 
# 17 NA    NA    NA    ubc2 
# 18 NA    NA    NA    drik2

更新的答案

我现在有一种方法,只要你知道后代的顺序,也就是我们需要对child_level向量排序的顺序,它就可以工作。如果你不知道,我们应该能够计算订单,但现在我假设它是已知的

该方法基于:

  1. 首先还要计算parent_level
  2. 通过parent_levelchild_level嵌套数据帧
  3. 一种自定义函数,可与purrr::accumulate2purrr::reduce2一起使用,该函数使用left_join连接行中的所有data.frames,如果再次连接现有列,则将相应的列合并为一个列

在应用此自定义join_merge函数之前:

  1. 嵌套的data.frames需要按照后代的顺序进行排序(child_level
  2. 列名parentchild被替换为parent_levelchild_level的值
  3. 最后parent_levelchild_level组合成一个名为arg_ls的向量,该向量作为.y参数传递给accumulate2(或者reduce2

我希望这对你的真实数据有效

library(tidyverse)

dat <- tribble(
  ~ parent, ~child, ~child_level,
  "d"     ,"sf"   ,"x",
  "d"     ,"st"   ,"x",
  "d"     ,"s0"   ,"x",
  "sf"    ,"gr4"  ,"l",
  "sf"    ,"gr3"  ,"l",
  "st"    ,"grd"  ,"l",
  "st"    ,"gr9"  ,"l",
  "s0"    ,"n7"   ,"l",
  "s0"    ,"b12"  ,"l",
  "grd"   ,"nyvc" ,"b",
  "gr3"   ,"trub2","b",
  "b12"   ,"ngb2" ,"b",
  "b12"   ,"ggb8" ,"b",
  "nyvc"  ,"xtr2d","i",
  "trub2" ,"xtuD" ,"i",
  "gr4"   ,"stab3","i",
  "gr9"   ,"ubc8" ,"i",
  "n7"    ,"ubc2" ,"i",
  "ggb8"  ,"drik2","i"
)
# in a first step we calculate the `parent_level`
dat <- dat %>% 
  left_join(., select(., -parent), by = c("parent" = "child")) %>% 
  rename("child_level" = "child_level.x",
         "parent_level" = "child_level.y") %>% 
  mutate(parent_level = replace_na(parent_level, "o"))

# we need this function to work with accumulate2 or reduce2
join_merge <- function(df1, df2, .rename) {
  res <- left_join(df1, df2, by = .rename[1]) 

  # in case an existing column is joined again, we need to merge it together
  if(length(colnames(select(res, starts_with(all_of(.rename[2]))))) > 1) {
    res <- mutate(res, across(matches(paste0(.rename[2], ".x")), 
                              ~ if_else(is.na(.x), eval(sym(paste0(.rename[2], ".y"))), .x))) %>% 
      select(-all_of(paste0(.rename[2], ".y"))) %>% 
      rename(!! .rename[2] := paste0(.rename[2], ".x"))
  }
  res
}


# accumulate is used to show how the final result is buildt
dat %>% 
  nest_by(child_level, parent_level) %>% 
  arrange(child_level == "i", desc(child_level)) %>% 
  mutate(arg_ls = list(c(parent_level, child_level))) %>% 
  mutate(data = list(rename_with(data,
                                 ~ paste0(child_level),
                                 "child") %>%
           rename_with(~ paste0(parent_level),
                       "parent"))) %>%
  ungroup %>% 
  mutate(dat_acc = accumulate2(data,
                              arg_ls[-1],
                              join_merge)) %>% 
  pull(dat_acc)
#> [[1]]
#> # A tibble: 3 x 2
#>   o     x    
#>   <chr> <chr>
#> 1 d     sf   
#> 2 d     st   
#> 3 d     s0   
#> 
#> [[2]]
#> # A tibble: 6 x 3
#>   o     x     l    
#>   <chr> <chr> <chr>
#> 1 d     sf    gr4  
#> 2 d     sf    gr3  
#> 3 d     st    grd  
#> 4 d     st    gr9  
#> 5 d     s0    n7   
#> 6 d     s0    b12  
#> 
#> [[3]]
#> # A tibble: 7 x 4
#>   o     x     l     b    
#>   <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA> 
#> 2 d     sf    gr3   trub2
#> 3 d     st    grd   nyvc 
#> 4 d     st    gr9   <NA> 
#> 5 d     s0    n7    <NA> 
#> 6 d     s0    b12   ngb2 
#> 7 d     s0    b12   ggb8 
#> 
#> [[4]]
#> # A tibble: 7 x 5
#>   o     x     l     b     i    
#>   <chr> <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA>  <NA> 
#> 2 d     sf    gr3   trub2 xtuD 
#> 3 d     st    grd   nyvc  xtr2d
#> 4 d     st    gr9   <NA>  <NA> 
#> 5 d     s0    n7    <NA>  <NA> 
#> 6 d     s0    b12   ngb2  <NA> 
#> 7 d     s0    b12   ggb8  drik2
#> 
#> [[5]]
#> # A tibble: 7 x 5
#>   o     x     l     b     i    
#>   <chr> <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA>  stab3
#> 2 d     sf    gr3   trub2 xtuD 
#> 3 d     st    grd   nyvc  xtr2d
#> 4 d     st    gr9   <NA>  ubc8 
#> 5 d     s0    n7    <NA>  ubc2 
#> 6 d     s0    b12   ngb2  <NA> 
#> 7 d     s0    b12   ggb8  drik2

reprex package(v0.3.0)于2020-12-22创建

相关问题 更多 >

    热门问题