读取多个仅在第一个文件中包含标题的CSV文件

header = spark.read \ .format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("path/file-1") schema = header.schema df = spark.read \ .format("csv") \ .option("header", "true") \ .schema(schema) \ .load("path")

1条回答

网友

1楼 · 发布于 2024-04-26 18:04:37

不幸的是，我不认为有一个简单的方法来做你想要的。不过，有一种解决方法看起来与您所做的类似。您可以读取第一个文件以获取模式，读取所有文件，但使用option("header", "false")读取第一个文件，然后将第一个文件与其余文件合并

在python中，它将如下所示：

first_file = "path/file-1"
header = spark.read.option("header", "true") \
  .option("inferSchema", "true").csv(first_file) 
schema = header.schema 

# I use binaryFiles simply to get the list of the files in the folder
# Not that the files are not read.
# Any other mean to list files in a directory would do the trick as well.
all_files = files = spark.sparkContext.binaryFiles("path")\
  .map(lambda x : x[0]).collect()
all_files_but_first = [f for f in all_files if not f.endswith(first_file)]

df = spark.read.option("header", "false") \
  .schema(schema).csv(all_files_but_first)\
  .union(header)

相关问题更多 >

编程相关推荐

热门问题

热门文章

读取多个仅在第一个文件中包含标题的CSV文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >