在Hadoop流式处理中使用elephantbird输入格式时出错 - 问答 - Python中文网

在Hadoop流式处理中使用elephantbird输入格式时出错

2024-10-03 23:24:07 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我尝试在我的Hadoop流脚本中使用来自Elephant Bird的输入格式。特别是，我想使用LzoInputFormat，最终使用LzoJsonInputFormat（在这里处理Twitter数据）。但是当我尝试这样做时，我总是得到一个错误，它表明大象鸟格式不是InputFormat类的有效实例。在

这是我如何运行流式处理命令：

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar \                                                                                                          
    -libjars /project/hanna/src/elephant-bird/build/elephant-bird-2.2.0.jar \                                                                                                              
    -D stream.map.output.field.separator=\t \                                                                                                                                              
    -D stream.num.map.output.key.fields=2 \                                                                                                                                                
    -D map.output.key.field.separator=\t \                                                                                                                                                 
    -D mapred.text.key.partitioner.options=-k1,2 \                                                                                                                                         
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterMap.py \                                                                                                   
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterReduce.py \                                                                                                
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/data/latinKeywords.txt \                                                                                                
    -inputformat com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat \                                                                                                             
    -input /user/ahanna/lzotest \                                                                                                                                                          
    -output /user/ahanna/output \                                                                                                                                                          
    -mapper filterMap.py \                                                                                                                                                                 
    -reducer filterReduce.py \                                                                                                                                                             
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

我得到的错误是：

^{pr2}$

Tags： key py hadoop map home output 格式错误

2条回答

网友

1楼 · 编辑于 2024-10-03 23:24:07

为了兼容性，Hadoop支持用Java编写map/reduce任务的两种方式：“旧”的通过org.apache.hadoop.mapred包的接口，而“新”通过org.apache.hadoop.mapreduce包中的抽象类实现。在

即使使用流式api，您也需要知道这一点，因为流式处理本身是使用旧方法编写的，因此，当您想用外部库更改流式处理机制的某些内部内容时，您应该确保该库也是用旧的方法编写的。在

你就是这样。在一般情况下，您需要编写包装器，但幸运的是，Elephant Bird提供了一个旧样式的InputFormat，因此您只需要将com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat替换为com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat。在

网友

2楼 · 编辑于 2024-10-03 23:24:07

在hadoop 2.4中，我设法用以下工具运行它：

-D org.apache.hadoop.mapreduce.lib.input.FileInputFormat=your.package.path.FileInputFormat

而不是标准-inputformat

相关问题更多 >

编程相关推荐

热门问题

热门文章