rdd = spark.sparkContext.parallelize([(1,2,3,4,5,6,7,8,9,10)])
def f(t):
for c in range(0,10):
yield tuple((i+c) * 1664525 for i in t)
#Increase the size of this loop to create more data.
#The number of rows will be 10 ^ n
for _ in range(0, 2):
rdd = rdd.flatMap(f)
rdd = rdd.repartition(int(spark.conf.get('spark.sql.shuffle.partitions')))
print(rdd.count())
#write result to parquet file
df = spark.createDataFrame(rdd)
df.write.parquet("mytestdata")
您可以在循环中使用flatMap来创建指数级增长的行数:
相关问题 更多 >
编程相关推荐