我有两个文件,一个包含关键字/字符串列表:
blue fox
the
lazy dog
orange
of
file
另一个,有文字:
^{pr2}$我想获取第一个文件中的字符串列表,并从第二个文件中找到与第一个文件中的任何字符串匹配的行。所以我用Python UDF编写了一个Pig脚本:
register match.py using jython as match;
A = LOAD 'words.txt' AS (word:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate match.match(C.$1,line);
dump D;
#match.py
@outputSchema("str:chararray")
def match(wordlist,line):
linestr = str(line)
for word in wordlist:
wordstr = str(word)
if re.search(wordstr,linestr):
return line
错误结束:
"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function"
Detailed Error log:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at o
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
at org.apache.pig.PigServer.openIterator(PigServer.java:828)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
================================================================================
我怀疑我的CDH4.x集群中的jython无法使用“re”模块。我在python UDF上没有花太多时间。我通过编写一个javaudf解决了这个问题。请原谅我的Java,因为我是一个n00b,可能不是最高效或最漂亮的Java代码(我确信其中有一些错误):
使用它的方法是用正则表达式填充一个文件,每行一个,你想搜索它,当然还有你的目标文本文件。所以一个regex文件“文字文本.txt“作为:
^{pr2}$还有,你的文本文件,文本.txt,是:
猪的剧本是:
相关问题 更多 >
编程相关推荐