无法下载AWS EMR上的nltk语料库,关闭fi上的I/O操作

2024-05-20 16:45:02 发布

您现在位置:Python中文网/ 问答频道 /正文

在用JupyterLab打开我的EMR集群之后。我无法用nltk.download()下载其他语料库。你知道吗

代码

nltk.download('wordnet')

错误

I/O operation on closed file
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 817, in download
    show('Downloading collection %r' % msg.collection.id)
  File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 783, in show
    subsequent_indent=prefix + prefix2 + ' ' * 4,
  File "/tmp/4461650941863117011", line 534, in write
    super(UnicodeDecodingStringIO, self).write(s)
ValueError: I/O operation on closed file

这是在确认nltk与sc.list_packages()一起安装之后。你知道吗

Package                    Version
-------------------------- -------
...
nltk                       3.4.5  
...

以及用import nltk导入nltk。你知道吗

这个问题似乎是因为我对电子病历的设置缺乏了解。你知道吗

有什么我应该试着调试的吗?你知道吗

更新:

我试过在引导脚本中安装它,它可以正确地安装。你知道吗

pip install nltk
python -m nltk.downloader wordnet

但当我尝试使用它时,仍然会出现这个错误。你知道吗

An error occurred while calling o166.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, ip-172-31-1-163.ca-central-1.compute.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 86, in __load
    root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
  File "/usr/local/lib/python3.6/site-packages/nltk/data.py", line 701, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource wordnet not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('wordnet')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/wordnet.zip/wordnet/

  Searched in:
    - '/home/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/share/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 345, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 334, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/util.py", line 113, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 19, in <lambda>
  File "<stdin>", line 19, in <listcomp>
  File "/usr/local/lib/python3.6/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 123, in __getattr__
    self.__load()
  File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 88, in __load
    raise e
  File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 83, in __load
    root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))
  File "/usr/local/lib/python3.6/site-packages/nltk/data.py", line 701, in find
    raise LookupError(resource_not_found)
LookupError: 

更新

我通过shell脚本找到了wordnet正在下载的目录,并通过ssh'ing进入服务器确认了它的实际位置。你知道吗

[nltk_data] Downloading package wordnet to /root/nltk_data...

所以在jupyter我检查nltk.data.path

['/var/lib/livy/nltk_data', '/tmp/1576616653412-0/nltk_data', '/tmp/1576616653412-0/share/nltk_data', '/tmp/1576616653412-0/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']

加上我的新路。你知道吗

nltk.data.path.append('/root/nltk_data')
nltk.data.path

我们可以看到它被添加了。你知道吗

['/var/lib/livy/nltk_data', '/tmp/1576616653412-0/nltk_data', '/tmp/1576616653412-0/share/nltk_data', '/tmp/1576616653412-0/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data', '/root/nltk_data']

但当我试图调用一个使用该语料库的函数时,它仍然没有被搜索到。你知道吗

  Resource wordnet not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('wordnet')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/wordnet.zip/wordnet/

  Searched in:
    - '/home/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/share/nltk_data'
    - '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'

此处不引用/root/nltk_data


Tags: inpydataapplicationlibusrlocalline
1条回答
网友
1楼 · 发布于 2024-05-20 16:45:02

由于无法更改用于加载wordnet的路径(更改nltk.data.path并没有更改nltk查找文件的位置)。你知道吗

我必须更改从引导脚本下载到的目录,以匹配默认情况下nltk的外观。你知道吗

启动脚本

sudo pip install nltk
sudo python -m nltk.downloader -d /home/nltk_data wordnet

相关问题 更多 >