回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我使用的是分布式的Dask,我试图从HDFS中存储的CSV创建一个数据帧。
我想连接到HDFS是成功的,因为我能够打印dataframe列的名称。
但是,当我试图在dataframe上使用<strong>len</strong>函数或任何其他函数时,会出现以下错误:</p>
<pre class="lang-py prettyprint-override"><code>pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv
</code></pre>
<p>我不明白我为什么会犯这个错误。我想听听你的意见。在</p>
<p>这是我的代码:</p>
^{pr2}$
<p>以下是我的HDFS存储库的内容:</p>
<pre><code>[F43479@xxxxx dask_tests]$ hdfs dfs -ls /user/F43479/
Found 9 items
-rw-r----- 3 F43479 hdfs 0 2019-03-07 16:42 /user/F43479/-
drwx------ - F43479 hdfs 0 2019-04-03 02:00 /user/F43479/.Trash
drwxr-x--- - F43479 hdfs 0 2019-03-13 16:53 /user/F43479/.hiveJars
drwxr-x--- - F43479 hdfs 0 2019-03-13 16:52 /user/F43479/hive
drwxr-x--- - F43479 hdfs 0 2019-03-15 13:23 /user/F43479/nyctaxi_trip_data
-rw-r----- 3 F43479 hdfs 36 2019-04-15 11:13 /user/F43479/test.csv
-rw-r----- 3 F43479 hdfs 50486731416 2019-03-26 17:37 /user/F43479/trip_data.csv
-rw-r----- 3 F43479 hdfs 5097056230 2019-04-15 13:57 /user/F43479/trip_data_v2.csv
-rw-r----- 3 F43479 hdfs 504867312828 2019-04-02 11:15 /user/F43479/trip_data_x10.csv
</code></pre>
<p>最后,代码执行的完整结果:</p>
<pre class="lang-py prettyprint-override"><code>Index(['vendor_id', 'passenger_count', 'trip_time_in_secs', 'trip_distance'], dtype='object')
Traceback (most recent call last):
File "dask_pa_hdfs.py", line 32, in <module>
print(len(df))
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/dataframe/core.py", line 438, in __len__
split_every=False).compute()
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 2321, in get
direct=direct)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1655, in gather
asynchronous=asynchronous)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 673, in sync
return sync(self.loop, func, *args, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1500, in _gather
traceback)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 133, in read_block_from_file
with copy.copy(lazy_file) as f:
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 177, in __enter__
f = SeekableFile(self.fs.open(self.path, mode=mode))
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/pyarrow.py", line 37, in open
return self.fs.open(path, mode=mode, **kwargs)
File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv
</code></pre>