猪:用于搜索文本中关键词/字符串列表的Python UDF

2024-10-08 19:32:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个文件,一个包含关键字/字符串列表:

blue fox
the
lazy dog
orange
of
file

另一个,有文字:

^{pr2}$

我想获取第一个文件中的字符串列表,并从第二个文件中找到与第一个文件中的任何字符串匹配的行。所以我用Python UDF编写了一个Pig脚本:

register match.py using jython as match;
A = LOAD 'words.txt' AS (word:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate match.match(C.$1,line);
dump D;

#match.py
@outputSchema("str:chararray")
def match(wordlist,line):
    linestr = str(line)
    for word in wordlist:
            wordstr = str(word)
            if re.search(wordstr,linestr):
                    return line

错误结束:

"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function"

Detailed Error log:

Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
        at o

Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
        at org.apache.pig.PigServer.openIterator(PigServer.java:828)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:538)
        at org.apache.pig.Main.main(Main.java:157)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
================================================================================

Tags: orghadoopbackendapachematcherrorjavaat
1条回答
网友
1楼 · 发布于 2024-10-08 19:32:57

我怀疑我的CDH4.x集群中的jython无法使用“re”模块。我在python UDF上没有花太多时间。我通过编写一个javaudf解决了这个问题。请原谅我的Java,因为我是一个n00b,可能不是最高效或最漂亮的Java代码(我确信其中有一些错误):

package pigext;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.IOException;
import java.util.*;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;

public class matchList extends EvalFunc<String> {

  public String exec(Tuple input) throws IOException {
try {
        String line = (String)input.get(0);
        DataBag bag = (DataBag)input.get(1);
        Iterator it = bag.iterator();
        String output = "";
        while (it.hasNext()){
                Tuple t = (Tuple)it.next();
                if (t != null && t.size() > 0 && t.get(0) != null && line != null ) 
                        {
                          String cmd = t.get(0).toString();
                          if ( line.toLowerCase().matches(cmd.toLowerCase()) ) {
                                return (line + "," + cmd);
                                }                         
                        }
         }
        return output;
        } catch (Exception e) {
           throw new IOException("Failed to process row", e);
        }

} }

使用它的方法是用正则表达式填充一个文件,每行一个,你想搜索它,当然还有你的目标文本文件。所以一个regex文件“文字文本.txt“作为:

^{pr2}$

还有,你的文本文件,文本.txt,是:

this blah starts with blah
this    blah has way too many spaces
that won't match
thisblahshouldnotmatch
thisblah should not match either
the line here is this blah
line here has this blah in the middle
line here has this    blah with extra spaces
only has blah
only has this

猪的剧本是:

REGISTER pigext.jar;
A = LOAD 'wordstest.txt' AS (cmd:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate pigext.matchList(line,C.$1);
dump D;

相关问题 更多 >

    热门问题