<P>考虑以下两个选项</P>
<p>注意,我使用的是稍加修改的数据示例-您将看到原因(我希望)</p>
<pre><code>with `project.dataset.table` as (
select '2021-01-01 00:01:00' sent , 'email4@example.com' recipient union all
select '2021-01-01 00:02:00', 'email2@example.com' union all
select '2021-01-01 00:03:00', 'email4@example.com' union all
select '2021-01-01 00:04:00', 'email3@example.com' union all
select '2021-01-01 00:05:00', 'email4@example.com' union all
select '2021-01-01 00:06:00', 'email2@example.com'
)
</code></pre>
<p>备选案文1:</p>
<p>在这种情况下,如果在分配唯一的_id之前应该设置这些电子邮件的顺序-例如通过<code>sent</code>列。在这种情况下考虑以下</P>
<pre><code>#standardSQL
create temp function factorize(item string, list any type) as ((
select unique_id from (
select as struct recipient, row_number() over(order by min(sent)) - 1 unique_id
from unnest(list)
group by recipient
)
where recipient = item
));
select t.*,
factorize(recipient, array_agg(struct(recipient, sent)) over()) unique_id
from `project.dataset.table` t
</code></pre>
<p>有输出</p>
<p><a href="https://i.stack.imgur.com/Nf9Q2.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Nf9Q2.png" alt="enter image description here"/></a></p>
<p>备选案文2:</p>
如果排序不是很重要,你可以按字母顺序排序,下面考虑一下使用内置^ ^ a2}函数</p>的简单查询。
<pre><code>#standardSQL
create temp function factorize(item string, list any type) as (
range_bucket(item, list) - 1
);
with all_recipients as (
select array_agg(recipient order by recipient) recipients from (
select recipient
from `project.dataset.table`
group by recipient
)
)
select t.*,
factorize(recipient, recipients) unique_id
from `project.dataset.table` t, all_recipients
</code></pre>
<p>有输出</p>
<p><a href="https://i.stack.imgur.com/ymW8i.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/ymW8i.png" alt="enter image description here"/></a></p>
<p>显然,在这种情况下,您可以跳过使用udf,只需在最终选择中使用ragge_bucket(而不是在udf中)</p>
<pre><code>select t.*,
range_bucket(recipient, recipients) - 1 unique_id
</code></pre>