使用hadoop 3.2安装Pyspark 3.0+,df.write上出错

2024-06-28 05:55:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我在Windows10上试用了spark-3.0.2和bin-hadoop2.7,效果很好。✅

然而。。。😟

当我试图用bin-hadoop3.2安装pyspark-3.1.1时,我在df.write..上遇到一个错误。
我想要一个带有hadoop 3.2+的pyspark版本3.0+,因为我想将spark与azure ADL连接起来,这是一个要求

我遵循与以前相同的步骤。(安装了相同版本的pyspark,添加了winutils,设置了pyspark\u HOMEHADOOP\u HOME,并将它们添加到路径中)。虽然我可以成功地启动spark会话并使用df = spark.range(0, 5)创建df,但我得到以下错误❌ 使用df.write.csv('mycsv.csv')或df.write.parquet('myp.parquet')等。❌

错误如下:

Traceback (most recent call last):
  File "<input>", line 23, in <module>
  File "C:\Users\user\Work\RAP2.0\.rap2_env\lib\site-packages\pyspark\sql\readwriter.py", line 1158, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "C:\Users\user\Work\RAP2.0\.rap2_env\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\user\Work\RAP2.0\.rap2_env\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
    return f(*a, **kw)
  File "C:\Users\user\Work\RAP2.0\.rap2_env\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o49.saveAsTable.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:560)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:587)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
    at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)
    at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:121)
    at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:118)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:137)
    at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:124)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$catalog$1(BaseSessionStateBuilder.scala:140)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:98)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:98)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:266)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:191)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:487)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:474)
    at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadTable(V2SessionCatalog.scala:65)
    at org.apache.spark.sql.connector.catalog.DelegatingCatalogExtension.loadTable(DelegatingCatalogExtension.java:68)
    at org.apache.spark.sql.delta.catalog.DeltaCatalog.loadTable(DeltaCatalog.scala:147)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:637)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:623)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

我怀疑这与我从这里得到的winutils文件有关:https://github.com/cdarlint/winutils用于3.2.0。也许是无效的

有什么想法吗


Tags: orghadoopsqlapachejavaatsparkpyspark