如何在RDD PYSPARK中计算每个特定行的总销售价格

2024-05-21 15:39:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个数据集:

1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PKG|901.00|ly. slyly ironi|

2|blush thistle blue yellow saddle|Manufacturer#1|Brand#13|LARGE BRUSHED BRASS|1|LG CASE|902.00|lar accounts amo|

3|spring green yellow purple cornsilk|Manufacturer#4|Brand#42|STANDARD POLISHED BRASS|21|WRAP CASE|903.00|egular deposits hag|

4|cornflower chocolate smoke green pink|Manufacturer#3|Brand#34|SMALL PLATED BRASS|14|MED DRUM|904.00|p furiously r|

我想计算每个品牌的总销售价格。 例如品牌13(901,00+913,00=1814009)

这是我的密码:

    from operator import add
import operator
from pyspark.sql import SQLContext
from pyspark.sql import Window
import pyspark.sql.functions
from pyspark import SparkContext, SparkConf
import pyspark

conf = SparkConf().setAppName("part").setMaster("local[*]")
sc = SparkContext(conf = conf)

def Func(lines):
      
    lines = lines.split("|") 
    return  lines[2],lines[3]

def Funcc(lines):
      
    lines = lines.split("|") 
    return  lines[3],lines[7]



text = sc.textFile("part.tbl")
text1 = text.map(Func)
text2 = text.map(Funcc)

sort1 = text1.distinct().sortBy(lambda x:x[0], ascending=True).sortBy(lambda y:y[1], ascending = True)
sort2 = text2.sortBy(lambda x:x[0], ascending=True)

original_text = sort1.collect()
count_by_key = sort2.countByKey()
summe = sort2.reduceByKey(add).collect()


print("Manufacturer and Brands:")
for line in original_text:
    print(line)

print("Number of Items of each Brand")
print(count_by_key)
print(summe)

我不允许使用数据帧。。 我试了一下:

summe = sort2.collect()
summe1 = sum(summe[1])

但代码不起作用:错误:

summe1=sum(summe[1]) TypeError:不支持+:“int”和“str”的操作数类型


Tags: lambdatextfromimportsqlconfpysparklines