如何在Python中为hadoop中的mapreducer创建groupby和sort

2024-09-23 22:21:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含10万行17列的数据集。 我想知道如何在hadoopmapreducer中使用python进行分组和排序

这是我的mapper.py

#!/usr/bin/python

import sys


for line in sys.stdin:
    data = line.strip().split(",")
    if len(data) == 17:
        VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount = data
    print passenger_count,"\t",PULocationID,"\t", DOLocationID

我的reducer.py

^{pr2}$

为了在hadoop中运行mapreducer,我使用了以下命令 hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper mapper.py -reducer reducer.py -input /user/maria_dev/trim_table.csv -output /user/maria_dev/joboutput1 -file /root/mapper.py /root/reducer.py

myjoboutput的输出 1 161 142 1 170 162 1 233 248 1 68 230 1 237 237 and so on with some duplication and unsorted.

希望有人能帮我。 我想要的输出是 count|PULocationID|DOLocationID

我想按计数分类


Tags: andpyhadoopdatadatetimeusrcountsys