擅长:python、mysql、java
<p><strong>1。惰性计算</strong></p>
<p>达斯克懒洋洋地评价。单独调用<code>dataset</code>不会触发任何计算。您需要调用<code>dataset.compute()</code>或<code>dataset.persist()</code>来触发计算并检查数据帧。现有答案建议使用<code>dataframe.head()</code>实际上是在数据子集上调用<code>.compute()</code>。阅读更多关于这意味着什么<a href="https://docs.dask.org/en/stable/user-interfaces.html?highlight=lazy%20evaluation#lazy-vs-immediate" rel="nofollow noreferrer">here in the Dask docs</a></p>
<p><strong>2。列修剪</strong></p>
<>你可能想考虑把你的数据集转换成实木拼花地板。来自<a href="https://coiled.io/blog/parquet-column-pruning-predicate-pushdown/" rel="nofollow noreferrer">this resource</a>:“Parquet允许您从数据集中读取特定列,而无需读取整个文件。这称为列修剪,可以极大地提高性能。”</p>
<p><strong>玩具代码示例</strong></p>
<pre class="lang-py prettyprint-override"><code># read in your csv
dataset= dd.read_csv('your.csv')
# store as parquet
dataset.to_parquet('your.parquet', engine='pyarrow')
# read in parquet file with selected columns
dataset = dd.read_parquet('your.parquet', columns=list_of_columns)
dataset.head()
</code></pre>