将大火花数据帧写入csv fi

def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'): """get spark_df from hadoop and save to a csv file Parameters ---------- spark_df: incoming dataframe n: number of rows to get save_csv=None: filename for exported csv Returns ------- """ # use the more robust method # set temp names tmpfilename = save_csv or (wfu.random_filename() + '.csv') tmpfoldername = wfu.random_filename() print n # write sparkdf to hadoop, get n rows if specified if n: spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote) else: spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote) # get merge file from hadoop HDFSUtil.getmerge(tmpfoldername, tmpfilename) HDFSUtil.rmdir(tmpfoldername) # read into pandas df, remove tmp csv file pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote) os.remove(tmpfilename) # re-write the csv file with header! if save_csv is not None: pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)

2条回答

网友

1楼 · 编辑于 2024-09-30 14:27:20

If the DataFrame is too big, how can I avoid using Pandas?

您只需将文件保存到HDFS或S3或任何您拥有的分布式存储中。在

Is directly writing to a csv using file I/O a better way? Can it preserve the separators?

如果您的意思是将文件保存到本地存储-它仍然会导致OOM异常，因为您需要移动本地机器上内存中的所有数据。在

Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?

在这种情况下，您只有一个文件（因为您有coalesce(1)）。所以你不需要关心标题。相反，您应该关心执行器上的内存，因为所有数据都将移动到该执行器，所以可能会在执行器上获得OOM。在

Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?

绝对更好（但不要使用coalesce()）。Spark将高效地将数据写入存储器，然后HDFS将复制数据，然后getmerge将能够高效地从节点读取数据并合并数据。在

网友

2楼 · 编辑于 2024-09-30 14:27:20

我们用了数据库库。它工作得很好

df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))

图书馆：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章