如何按值分组，按降序值排序，然后过滤到分位数（0.1）

2024-05-18 06:33:44 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个数据帧(p4p5_merge），当前看起来像这样：

    SampleID      expr             Gene  Period                     tag  \
1    HSB666  3.663308  ENSG00000147996       5  HSB666|ENSG00000147996   
2    HSB666  3.663308  ENSG00000147996       5  HSB666|ENSG00000147996   
3    HSB666  3.663308  ENSG00000147996       5  HSB666|ENSG00000147996   
4    HSB666  3.663308  ENSG00000147996       5  HSB666|ENSG00000147996   
5    HSB651  3.207474  ENSG00000174749       4  HSB651|ENSG00000174749   
6    HSB651  3.207474  ENSG00000174749       4  HSB651|ENSG00000174749   
7    HSB651  3.207474  ENSG00000174749       4  HSB651|ENSG00000174749   
8    HSB651  3.207474  ENSG00000174749       4  HSB651|ENSG00000174749   
9    HSB651  3.207474  ENSG00000174749       4  HSB651|ENSG00000174749   
10   HSB195  0.214731  ENSG00000188157       4  HSB195|ENSG00000188157   
11   HSB195  0.214731  ENSG00000188157       4  HSB195|ENSG00000188157   
12   HSB195  0.214731  ENSG00000188157       4  HSB195|ENSG00000188157   
14   HSB152  5.062444  ENSG00000188157       4  HSB152|ENSG00000188157   
15   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
16   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
17   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
18   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
19   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
20   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
21   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
22   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   
23   HSB627  2.062444  ENSG00000174749       4  HSB627|ENSG00000174749   

              Consequence  
1   upstream_gene_variant  
2   upstream_gene_variant  
3   upstream_gene_variant  
4   upstream_gene_variant  
5   upstream_gene_variant  
6   upstream_gene_variant  
7   upstream_gene_variant  
8   upstream_gene_variant  
9   upstream_gene_variant  
10  upstream_gene_variant  
11  upstream_gene_variant  
12  upstream_gene_variant  
14  upstream_gene_variant  
15  upstream_gene_variant  
16  upstream_gene_variant  
17  upstream_gene_variant  
18  upstream_gene_variant  
19  upstream_gene_variant  
20  upstream_gene_variant  
21  upstream_gene_variant  
22  upstream_gene_variant  
23         intron_variant

我现在要按Gene分组，按expr降序排序，然后将数据帧向下过滤到每个Gene组的expr值底部10%的行（第10个百分位）。因此，我执行以下操作：

1）按表达式降序排序（成功）

p4p5_sort= p4p5_merge.sort_values(['expr', 'Gene'],
           ascending=[False, True]).reset_index(drop=True)

2）按基因分组，筛选表达/基因的10%（失败）

p4p5_bottom10  = (p4p5_sort[p4p5_sort.groupby('Gene')['expr'].
                 apply(lambda x: x < x.quantile(0.1))])

第1步的工作原理应该是这样的，但当我运行第2步时，我只得到以下响应：

sys:1: DtypeWarning: Columns (15,16,22,36,37,38,39) have mixed types. Specify dtype option on import or set low_memory=False.
Empty DataFrame
Columns: [SampleID, expr, Gene, Period, tag, Consequence]
Index: []

如果有帮助的话，我要做的就是：

p4p5_bottom10 <- p4p5_merge %>% select(Gene, expr, SampleID, Period) %>%
    group_by(Gene) %>% 
    arrange(Gene, desc(expr)) %>%
    filter(expr < quantile(expr, 0.1))

Tags： merge sort variant gene upstream expr ensg00000188157 hsb651

1条回答

网友

1楼 · 发布于 2024-05-18 06:33:44

您可以将分位数直接应用于grouby，如下所示：
p4p5_bottom10 = pd.DataFrame(p4p5_sort.groupby(['Gene'])['expr'].quantile(0.1))

我们必须申请pd.数据帧（）转换为DF。你知道吗

如何按值分组，按降序值排序，然后过滤到分位数（0.1）

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何按值分组，按降序值排序，然后过滤到分位数（0.1）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >