擅长:python、mysql、java
<blockquote>
<p>If the DataFrame is too big, how can I avoid using Pandas?</p>
</blockquote>
<p>您只需将文件保存到HDFS或S3或任何您拥有的分布式存储中。在</p>
<blockquote>
<p>Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?</p>
</blockquote>
<p>如果您的意思是将文件保存到本地存储-它仍然会导致OOM异常,因为您需要移动本地机器上内存中的所有数据。在</p>
<blockquote>
<p>Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?</p>
</blockquote>
<p>在这种情况下,您只有一个文件(因为您有<code>coalesce(1)</code>)。所以你不需要关心标题。相反,您应该关心执行器上的内存,因为所有数据都将移动到该执行器,所以可能会在执行器上获得OOM。在</p>
<blockquote>
<p>Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?</p>
</blockquote>
<p>绝对更好(但不要使用<code>coalesce()</code>)。Spark将高效地将数据写入存储器,然后HDFS将复制数据,然后getmerge将能够高效地从节点读取数据并合并数据。在</p>