我有一个包含10万行17列的数据集。 我想知道如何在hadoopmapreducer中使用python进行分组和排序
这是我的mapper.py
#!/usr/bin/python
import sys
for line in sys.stdin:
data = line.strip().split(",")
if len(data) == 17:
VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount = data
print passenger_count,"\t",PULocationID,"\t", DOLocationID
我的reducer.py
为了在hadoop中运行mapreducer,我使用了以下命令
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper mapper.py -reducer reducer.py -input /user/maria_dev/trim_table.csv -output /user/maria_dev/joboutput1 -file /root/mapper.py /root/reducer.py
myjoboutput
的输出
1 161 142
1 170 162
1 233 248
1 68 230
1 237 237
and so on with some duplication and unsorted.
希望有人能帮我。
我想要的输出是
count|PULocationID|DOLocationID
我想按计数分类
目前没有回答
相关问题 更多 >
编程相关推荐