我有一个配置单元表,它可以跟踪在进程的各个阶段中移动的对象的状态。表格如下:
hive> desc journeys;
object_id string
journey_statuses array<string>
下面是一个典型的记录示例:
^{pr2}$表中的记录是使用hive0.13的collect_list
生成的,状态有一个顺序(如果顺序不重要,我会使用collect_set
)。对于每个object_id,我想将journey缩写为按其出现的顺序返回旅程状态。在
我写了一个快速的Python脚本,从stdin读取:
#!/usr/bin/env python
import sys
import itertools
for line in sys.stdin:
inputList = eval(line.strip())
readahead = iter(inputList)
next(readahead)
result = []
for id, (a, b) in enumerate(itertools.izip(inputList, readahead)):
if id == 0:
result.append(a)
if a != b:
result.append(b)
print result
我计划在一个Hivetransform
调用中使用这个。它在本地运行时似乎可以工作:
$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py
['A', 'B', 'C', 'D']
但是,当我添加文件并尝试在配置单元中执行时,会返回一个错误:
hive> add file abbreviateList.py;
Added resource: abbreviateList.py
hive> select
> object_id,
> transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated
> from journeys;
NoViableAltException( ... wall of Java error messages ... )
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression
你能看出我做错了什么吗?在
显然,您不能选择不在转换中的其他字段(在您的示例中,object iu id)。另一个问题似乎间接地解决了:
How can select a column and do a TRANSFORM in Hive?
理论上,您可以修改Python以接受object_id作为输入参数,并使其成为另一个输出字段的传递(如果需要将其包含在输出中)。在
相关问题 更多 >
编程相关推荐