在EMR上运行sparknlp文档组装器

2024-06-28 07:09:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试在EMR上运行sparknlp。我登录我的齐柏林飞艇笔记本并运行以下命令

import sparknlp
spark = SparkSession.builder \
    .appName("BBC Text Categorization")\
    .config("spark.driver.memory","8G")\
    .config("spark.memory.offHeap.enabled",True)\
    .config("spark.memory.offHeap.size","8G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.4.5")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .config("spark.network.timeout","3600s")\
    .getOrCreate()
from sparknlp.base import DocumentAssembler
documentAssembler = DocumentAssembler()\
     .setInputCol("description") \
     .setOutputCol('document')

这导致了以下错误:

Fail to execute line 1: documentAssembler = DocumentAssembler()\
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4581426413302524147.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/base.py", line 148, in __init__
    super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 72, in __init__
    self._java_obj = self._new_java_obj(classname, self.uid)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 67, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

为了理解这个问题,我尝试登录到master并在pyspark控制台中运行上述命令。 如果使用以下命令启动pyspark console,则一切正常,不会出现上述错误: pyspark --packages JohnSnowLabs:spark-nlp:2.4.5

但是我在使用命令pyspark时得到了与以前相同的错误

我怎样才能在我的齐柏林飞艇笔记本上工作

设置详细信息:

EMR 5.27.0
spark 2.4.4
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)

以下是我的引导脚本:

#!/bin/bash
sudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenv

sudo python36 -m pip install --upgrade pip

sudo python36 -m pip install pandas

sudo python36 -m pip install boto3

sudo python36 -m pip install re

sudo python36 -m pip install spark-nlp==2.7.2

Tags: installpipinpyselfconfiglibline