SQL(或python)从类似行中选择一次值

2024-09-30 04:27:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个Oracle表中的数据,我需要根据由于分组而重复的唯一值的数量从中选择某些行。我的数据是这样的

| LINE | BUCKET | TERM | COURSE     |
|------|--------|------|------------|
| 1001 | 1      | FA18 | COURSE 101 |
| 1001 | 1      | SP19 | COURSE 102 |
| 1001 | 1      | SP19 | COURSE 103 |
| 1001 | 1      | FA19 | COURSE 104 |
| 1001 | 2      | FA18 | COURSE 101 |
| 1001 | 2      | SP19 | COURSE 102 |
| 1001 | 2      | SP19 | COURSE 103 |
| 1001 | 2      | FA19 | COURSE 104 |
| 2001 | 1      | FA18 | COURSE 201 |
| 2001 | 1      | SP19 | COURSE 202 |
| 2001 | 1      | FA20 | COURSE 203 |
| 2001 | 2      | FA18 | COURSE 201 |
| 2001 | 2      | SP19 | COURSE 202 |
| 2001 | 2      | FA20 | COURSE 203 |
| 2001 | 3      | FA18 | COURSE 201 |
| 2001 | 3      | SP19 | COURSE 202 |
| 2001 | 3      | FA20 | COURSE 203 |

数据中有两行。第一行(1001)有2个不同的桶和4个不同的课程。第二条线(2001年)有3个不同的桶和3个不同的球场。我需要 为一行中的每个课程只选择一行,并选择尽可能多的桶。数学很简单:

  • 第1001行:4(课程)/2(桶)=每个桶2个课程
  • 第2001行:3个(课程)/3个(桶)=每个桶1个课程

如何在多个桶中每行选择一个课程,使其看起来像这样

| LINE | BUCKET | TERM | COURSE     |
|------|--------|------|------------|
| 1001 | 1      | FA18 | COURSE 101 |
| 1001 | 1      | SP19 | COURSE 102 |
| 1001 | 2      | SP19 | COURSE 103 |
| 1001 | 2      | FA19 | COURSE 104 |
| 2001 | 1      | FA18 | COURSE 201 |
| 2001 | 2      | SP19 | COURSE 202 |
| 2001 | 3      | FA20 | COURSE 203 |

解决方案可以是SQL或python


Tags: 数据数量bucketline数学解决方案课程oracle
2条回答

如果您的bucket总是从1开始并且increment=1,那么您可以使用非常简单的mod(row_number, max(bucket))

select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
from t 
group by line, term, course
order by line,course;

示例数据的完整示例:

with t(LINE ,BUCKET ,TERM ,COURSE) as (
   select 1001, 1, 'FA18', 'COURSE 101' from dual union all
   select 1001, 1, 'SP19', 'COURSE 102' from dual union all
   select 1001, 1, 'SP19', 'COURSE 103' from dual union all
   select 1001, 1, 'FA19', 'COURSE 104' from dual union all
   select 1001, 2, 'FA18', 'COURSE 101' from dual union all
   select 1001, 2, 'SP19', 'COURSE 102' from dual union all
   select 1001, 2, 'SP19', 'COURSE 103' from dual union all
   select 1001, 2, 'FA19', 'COURSE 104' from dual union all
   select 2001, 1, 'FA18', 'COURSE 201' from dual union all
   select 2001, 1, 'SP19', 'COURSE 202' from dual union all
   select 2001, 1, 'FA20', 'COURSE 203' from dual union all
   select 2001, 2, 'FA18', 'COURSE 201' from dual union all
   select 2001, 2, 'SP19', 'COURSE 202' from dual union all
   select 2001, 2, 'FA20', 'COURSE 203' from dual union all
   select 2001, 3, 'FA18', 'COURSE 201' from dual union all
   select 2001, 3, 'SP19', 'COURSE 202' from dual union all
   select 2001, 3, 'FA20', 'COURSE 203' from dual
)
select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
from t 
group by line, term, course
order by line,course;

结果:

LINE    TERM    COURSE      BUCKET_N
   -    -      -     
1001    FA18    COURSE 101  1
1001    SP19    COURSE 102  2
1001    SP19    COURSE 103  1
1001    FA19    COURSE 104  2
2001    FA18    COURSE 201  1
2001    SP19    COURSE 202  2
2001    FA20    COURSE 203  3

另一个有趣的变体是聚合存储桶并通过位置mod(rownumber, count(buckets))提取值-与上一个解决方案相反,它适用于任何存储桶:

select
   line, term, course
  ,xmlcast(
     xmlelement(
        "buckets",  
        xmlagg(xmlelement("bucket", bucket))
     ).extract('/buckets/*['||
      (1+mod(-1+row_number()over(partition by line order by course),count(bucket)))
     ||']')
    as int) bucket_n_2
from t 
group by line, term, course
order by line,course;

完整测试用例:

with t(LINE ,BUCKET ,TERM ,COURSE) as (
   select 1001, 1, 'FA18', 'COURSE 101' from dual union all
   select 1001, 1, 'SP19', 'COURSE 102' from dual union all
   select 1001, 1, 'SP19', 'COURSE 103' from dual union all
   select 1001, 1, 'FA19', 'COURSE 104' from dual union all
   select 1001, 2, 'FA18', 'COURSE 101' from dual union all
   select 1001, 2, 'SP19', 'COURSE 102' from dual union all
   select 1001, 2, 'SP19', 'COURSE 103' from dual union all
   select 1001, 2, 'FA19', 'COURSE 104' from dual union all
   select 2001, 1, 'FA18', 'COURSE 201' from dual union all
   select 2001, 1, 'SP19', 'COURSE 202' from dual union all
   select 2001, 1, 'FA20', 'COURSE 203' from dual union all
   select 2001, 2, 'FA18', 'COURSE 201' from dual union all
   select 2001, 2, 'SP19', 'COURSE 202' from dual union all
   select 2001, 2, 'FA20', 'COURSE 203' from dual union all
   select 2001, 3, 'FA18', 'COURSE 201' from dual union all
   select 2001, 3, 'SP19', 'COURSE 202' from dual union all
   select 2001, 3, 'FA20', 'COURSE 203' from dual
)
select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
  ,xmlcast(
     xmlelement(
        "buckets",  
        xmlagg(xmlelement("bucket", bucket))
     ).extract('/buckets/*['||
      (1+mod(-1+row_number()over(partition by line order by course),count(bucket)))
     ||']')
    as int) bucket_n_2
from t 
group by line, term, course
order by line,course;
  • xmlagg(xmlelement("bucket", bucket))聚合所有存储桶编号
  • extract('/buckets/*[N])-从聚合值中提取N的存储桶
  • (1+mod(-1+row_number()over(partition by line order by course),count(bucket)))-计算第n个存储桶

结果:BUCKET_N-先前,BUCKET_N_2-新变体:

LINE    TERM    COURSE      BUCKET_N    BUCKET_N_2
1001    FA18    COURSE 101          1           1
1001    SP19    COURSE 102          2           2
1001    SP19    COURSE 103          1           1
1001    FA19    COURSE 104          2           2
2001    FA18    COURSE 201          1           1
2001    SP19    COURSE 202          2           3
2001    FA20    COURSE 203          3           2

基本思想是row_number()。如果您只想对桶进行随机抽样:

select t.*
from (select t.*,
             row_number() over (partition by line, course order by dbms_random.random) as seqnum
      from t
     ) t
where seqnum = 1;

如果您确实希望在存储桶上进行循环(以确保选择了最大数量),那么随机性是不够的:

select t.*
from (select t.*,
             row_number() over (partition by line, course order by seqnum_bucket, dbms_random.random) as seqnum
      from (select t.*
                   row_number() over (partition by line, course, bucket order by dbms_random.random) as seqnum_bucket
            from t
           ) t
     ) t
where seqnum = 1;

相关问题 更多 >

    热门问题