Pyspark基于dataframe groupBy生成多个文件

grouped = df.groupby("country_code") # run this to generate separate Excel files for country_code, group in grouped: group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)

2条回答

网友

1楼 · 编辑于 2024-10-01 09:35:20

如果您的要求是将所有国家/地区的数据保存在不同的文件中，您可以通过对数据进行分区来实现，但您将获得每个国家/地区的文件夹，而不是文件，因为spark无法将数据直接保存到文件中

Spark在调用数据帧编写器时创建文件夹

df.write.partitionBy('country_code').csv(path)

输出将是包含相应国家/地区数据的多个文件夹

path/country_code=india/part-0000.csv
path/country_code=australia/part-0000.csv

如果您希望每个文件夹中都有一个文件，您可以将数据重新分区为

df.repartition('country_code').write.partitionBy('country_code').csv(path)

网友

2楼 · 编辑于 2024-10-01 09:35:20

在编写时使用partitionBy，这样每个分区都基于您指定的列（country_code）

这是more关于这个

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pyspark基于dataframe groupBy生成多个文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >