通过pythonudf将文本文件导入pig

from __future__ import with_statement def get_words(dir): stopwords=set() with open(dir) as f1: for line1 in f1: stopwords.update([line1.decode('ascii','ignore').split("\n")[0]]) return stopwords stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt") @outputSchema("findit: int") def findit(stp): stp=str(stp) if stp in stopwords: return 1 else: return 0

def get_wordlists(wordbag): stopwords=set() for t in wordbag: stopwords.update(t.decode('ascii','ignore')) return stopwords @outputSchema("findit: int") def findit(stopwordbag, stp): stopwords=get_wordlists(stopwordbag) stp=str(stp) if stp in stopwords: return 1 else: return 0

REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf; stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray); -- this step works fine and i can see the "stops" obejct is loaded to pig item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray); T = limit item_title 1; S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title); DUMP S;

1条回答

网友

1楼 · 发布于 2024-10-04 05:33:57

你的第二个例子应该有用。虽然您LIMIT编错了表达式，但它应该在stops关系上。因此，它应该是：

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);

但是，因为看起来你需要先处理所有的停止词，这是不够的。您需要执行GROUP ALL，然后将结果传递给您的get_wordlist函数：

^{pr2}$

你必须更新你的自定义项以接受一个dict列表，但是这个方法才能工作。在

相关问题更多 >

编程相关推荐

热门问题

热门文章