<p>我正在尝试从数百个大型CSV文件的单个列中创建一个Keras<a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer" rel="nofollow noreferrer">Tokenizer</a>。Dask似乎是一个很好的工具。我目前的做法最终会导致内存问题:</p>
<pre><code>df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
</code></pre>
<p>我怎样才能按零件来做呢?大致如下:</p>
<pre><code>df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time
</code></pre>
<p>我还建议<code>map_partition</code>当它适合您的问题时。但是,如果您真的只需要顺序访问和类似于<code>read_csv(chunksize=...)</code>的API,那么您可能需要查找partitions属性</p>
<pre><code>for part in df.partitions:
process(model, part.compute())
</code></pre>