我正在使用boto库在Amazons弹性MapReduce Webservice(EMR)中创建一个作业流。以下代码应创建一个步骤:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
运行作业流时,总是无法引发以下错误:
^{pr2}$这是EMR日志中调用java代码的行:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
参数有什么问题?java类定义可在此处找到:
我找到了解决问题的方法:
以下是如何调用job\u flow函数以与mahout一起运行:
jobid = emr_conn.run_jobflow(name = name, log_uri = 's3n://'+ main_bucket_name +'/emr-logging/', enable_debugging=1, hadoop_version='0.20', steps=[step1,step2])
上述步骤2中描述的boto修复(即使用非版本化的hadoop-streamin.jar文件文件)已合并到github主服务器中:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7
从博图做这件事的一些参考
显然你需要上传mahout-core-0.6-作业.jar到可进入的s3位置。输入和输出必须是可访问的。在
相关问题 更多 >
编程相关推荐