我有这样一个数据集:
1|goldenrod lavender spring chocolate lace|Manufacturer#1|Brand#13|PROMO BURNISHED COPPER|7|JUMBO PKG|901.00|ly. slyly ironi|
2|blush thistle blue yellow saddle|Manufacturer#1|Brand#13|LARGE BRUSHED BRASS|1|LG CASE|902.00|lar accounts amo|
3|spring green yellow purple cornsilk|Manufacturer#4|Brand#42|STANDARD POLISHED BRASS|21|WRAP CASE|903.00|egular deposits hag|
4|cornflower chocolate smoke green pink|Manufacturer#3|Brand#34|SMALL PLATED BRASS|14|MED DRUM|904.00|p furiously r|
我想计算每个品牌的总销售价格。 例如品牌13(901,00+913,00=1814009)
这是我的密码:
from operator import add
import operator
from pyspark.sql import SQLContext
from pyspark.sql import Window
import pyspark.sql.functions
from pyspark import SparkContext, SparkConf
import pyspark
conf = SparkConf().setAppName("part").setMaster("local[*]")
sc = SparkContext(conf = conf)
def Func(lines):
lines = lines.split("|")
return lines[2],lines[3]
def Funcc(lines):
lines = lines.split("|")
return lines[3],lines[7]
text = sc.textFile("part.tbl")
text1 = text.map(Func)
text2 = text.map(Funcc)
sort1 = text1.distinct().sortBy(lambda x:x[0], ascending=True).sortBy(lambda y:y[1], ascending = True)
sort2 = text2.sortBy(lambda x:x[0], ascending=True)
original_text = sort1.collect()
count_by_key = sort2.countByKey()
summe = sort2.reduceByKey(add).collect()
print("Manufacturer and Brands:")
for line in original_text:
print(line)
print("Number of Items of each Brand")
print(count_by_key)
print(summe)
我不允许使用数据帧。。 我试了一下:
summe = sort2.collect()
summe1 = sum(summe[1])
但代码不起作用:错误:
summe1=sum(summe[1]) TypeError:不支持+:“int”和“str”的操作数类型
我现在有了答案:你可以使用简单的函数reduceByKey:
相关问题 更多 >
编程相关推荐