Pypark单调地增加\u id（），不连续增加

unique_retailers = RETAILERS.select('ID').distinct().rdd.map(lambda r: r[0]).collect() CUSTOMERS = CUSTOMERS.orderBy(sf.col('REVENUE'), ascending=False) for i in unique_retailers: RANK = CUSTOMERS.select('ID').where(sf.substring(sf.col('ID'), 0, 1) == sf.lit(i)).withColumn('RANK_'+i, sf.monotonically_increasing_id()) RANK.show() CUSTOMERS = CUSTOMERS.join(RANK, ['ID'], 'left') CUSTOMERS.show()

+--------+--------------------+ |ID |RANK_1 | +--------+--------------------+ |1_502765| 0| |1_522762| 1| |1_532768| 17179869184| |1_452763| 68719476736| |1_522766| 94489280512| |1_512769| 214748364800| |1_542766| 223338299392| |1_452766| 549755813888| |1_542769| 549755813889| |1_512766| 721554505728| |1_132760| 962072674304| |1_522761| 996432412672| |1_542764| 1065151889408| |1_172765| 1151051235328| |1_542762| 1194000908288| |1_542765| 1245540515840| |1_532766| 1254130450432| |1_542760| 1400159338496| |1_172767| 1408749273088| |1_412764| 1511828488192| +--------+--------------------+

1条回答

网友

1楼 · 发布于 2024-09-29 07:22:26

单调并不意味着连续。你所经历的实际上是预期的行为。看看documentation。。。请记住，spark是分布式的，因此生成连续的索引虽然不是不可能的，但也不是微不足道的。你知道吗

scala def monotonically_increasing_id(): Column
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

相关问题更多 >

编程相关推荐

热门问题

热门文章