无法使用AW上的流式python mapreduce通过stdin读取Hadoop序列文件

Mapper: s3://com.gpanterov.scripts/mapper.py Reducer: s3://com.gpanterov.scripts/reducer.py Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112 Output S3 location: s3://com.gpanterov.output/job3/

#!/usr/bin/env python import sys def output(previous_key, total): if previous_key != None: print previous_key + " was found " + str(total) + " times" previous_key = None total = 0 for line in sys.stdin: key, value = line.split("\t", 1) if key != previous_key: output(previous_key, total) previous_key = key total = 0 total += int(value) output(previous_key, total)

2条回答

网友

1楼 · 编辑于 2024-10-01 09:25:30

您需要将SequenceFileAsTextInputFormat作为inputformat提供给hadoop流媒体jar。在

我从未使用过amazon aws mapreduce，但在正常的hadoop安装中，它会这样做：

HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat

网友

2楼 · 编辑于 2024-10-01 09:25:30

Sunny Nanda的建议解决了这个问题。添加 -inputformat SequenceFileAsTextInputFormat 到aws弹性mapreduce API中的extra arguments框起作用，作业的输出如预期。在

相关问题更多 >

编程相关推荐

热门问题

热门文章