pandas`factorize（）`的ANSI SQL等价物？

一点背景

函数的作用是：根据提供的列分配一个唯一的ID，所以如果电子邮件像email1@example.com, email2@example.com, email1@example.com, email3@example.com, email1@example.com, email2@example.com，它将返回：[0, 1, 0, 2, 0, 1]等等

我在数据库中还有其他列，所以RANK()和ROW_NUMBER()似乎并没有单独的帮助。我正试着绕过那个

2条回答

网友

1楼 · 编辑于 2024-09-25 02:35:47

为此，可以使用DENSE_RANK()窗口函数：

select dataset.*, DENSE_RANK() OVER (ORDER BY email)
from dataset
order by sent;

这将产生如下结果（使用Mikhail Berlyant's example data作为起点）：

^{tb1}$

网友

2楼 · 编辑于 2024-09-25 02:35:47

考虑以下两个选项

注意，我使用的是稍加修改的数据示例-您将看到原因（我希望）

with `project.dataset.table` as (
  select '2021-01-01 00:01:00' sent , 'email4@example.com' recipient  union all 
  select '2021-01-01 00:02:00', 'email2@example.com' union all 
  select '2021-01-01 00:03:00', 'email4@example.com' union all 
  select '2021-01-01 00:04:00', 'email3@example.com' union all 
  select '2021-01-01 00:05:00', 'email4@example.com' union all 
  select '2021-01-01 00:06:00', 'email2@example.com'
)

备选案文1：

在这种情况下，如果在分配唯一的_id之前应该设置这些电子邮件的顺序-例如通过sent列。在这种情况下考虑以下

#standardSQL
create temp function factorize(item string, list any type) as ((
  select unique_id from (
    select as struct recipient, row_number() over(order by min(sent)) - 1 unique_id
    from unnest(list)
    group by recipient
  ) 
  where recipient = item
));
select t.*, 
  factorize(recipient, array_agg(struct(recipient, sent)) over()) unique_id 
from `project.dataset.table` t

有输出

备选案文2：

如果排序不是很重要，你可以按字母顺序排序，下面考虑一下使用内置^ ^ a2}函数

的简单查询。

#standardSQL
create temp function factorize(item string, list any type) as (
  range_bucket(item, list) - 1 
);
with all_recipients as (
  select array_agg(recipient order by recipient) recipients from (
    select recipient
    from `project.dataset.table`
    group by recipient
  )
)
select t.*,
  factorize(recipient, recipients) unique_id
from `project.dataset.table` t, all_recipients

有输出

显然，在这种情况下，您可以跳过使用udf，只需在最终选择中使用ragge_bucket（而不是在udf中）

select t.*,
  range_bucket(recipient, recipients) - 1 unique_id

一点背景

相关问题更多 >

编程相关推荐

热门问题

热门文章