使用pythonmrjob在EMR上引导库

2024-10-01 09:35:03 发布

您现在位置:Python中文网/ 问答频道 /正文

问题陈述:

我试图使用pythonmrjob库在amazonemr中运行一个map reduce作业,但是在用必需的库和包引导节点时遇到了问题。在

详细信息:

我的示例python mrjob代码:

    import re
    from mrjob.job import MRJob
    from sentClassifier import sentClassify
    import nltk

    .. do something ..

有一些库需要导入,比如NLTK,还有一些我正在导入的本地模块,比如from sentClassifier import sentClassify

我想知道引导EMR节点的最佳方法是什么,以便这些方法和包可用。这个代码在我的本地机器上运行得很好。在

我的示例mrjob.conf文件:

^{pr2}$

但工作失败了。在

我通读了以下参考文献,并尝试了他们的各种方法,仍然没有成功:

错误日志:

    Scanning SSH logs for probable cause of failure
    Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
    Traceback (most recent call last):
    File "obidroidMR.py", line 5, in <module>
       import nltk
       ImportError: No module named nltk
       (while reading from s3://mrjob-   51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
       Attempting to terminate job...
       Job appears to have already been terminated
       Killing our SSH tunnel (pid 12909)
       Traceback (most recent call last):
         File "obidroidMR.py", line 107, in <module>
         ObidroidReview.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
         mr_job.execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
         self.run_job()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
         self._wait_for_job_to_complete()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
         raise Exception(msg)
         Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
         Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
         Traceback (most recent call last):
         File "obidroidMR.py", line 5, in <module>
         import nltk
         ImportError: No module named nltk

任何帮助都将不胜感激


Tags: runinpyimportliblinesitejob
2条回答

mrjob.conf中,安装包所需的行可能不在它们应该的位置。应该应用于在EMR上运行的作业的内容应该列在emr:下,而不是{}(这是在本地Hadoop安装上运行作业时的配置)。在

如果它是一个简单的Linux命令,比如pipapt-get,那么您应该能够安装这样的软件包:

runners:
  emr:
    aws_access_key_id: ***
    ... all the other stuff ...
    bootstrap_cmds:
    - sudo apt-get install -y python-boto
    - sudo pip install simplejson

我从来没有尝试过具体地安装NLTK,所以我无法帮助您,但是您应该能够沿着这条线进行安装。在

对于可能更复杂的安装,我建议使用EMR CLI将ssh放到主节点上:

^{pr2}$

试着安装软件包。如果您找到一系列成功安装包的shell命令,那么您只需将其复制并粘贴到mrjob.conf中。在

假设Amazon Elastic Map Reduce使用AMI based on Amazon Linux,我验证了我可以使用以下方法在Amazon Linux AMI 2014.03.1-AMI-fb8e9292(64位)上安装nltk

sudo easy_install -U pip
sudo easy_install -U distribute
sudo pip install -U pyyaml nltk

你可以试着把这三条线合并到你的mrjob.conf公司在

相关问题 更多 >