我有一个pyspark数据帧,有3列:违规位置、违规代码和罚单频率。但是,在违规\u代码和违规\u位置列中都有几个类别(即每个类别超过100个)
我想得到前10名的违反位置和违反码的机票频率的基础上
Precint = spark.sql("SELECT Violation_Location, Violation_Code, Count(*) as Ticket_Frequency from table_view2 group by Violation_Location, Violation_Code order by Ticket_Frequency desc")
Precint.show()
Violation_Location|Violation_Code|Ticket_Frequency|
+------------------+--------------+----------------+
| null| 36| 1098296|
| null| 7| 471754|
| null| 5| 248774|
| 18| 14| 132123|
| 114| 21| 84051|
| 14| 14| 83664|
| 19| 46| 82640|
| 14| 69| 69006|
到目前为止,我只能根据罚单的频率来划分出十大违规地点。感谢任何形式的帮助,谢谢
# plot violation based on the states the cars were registered to
precintplot = Precint.toPandas()
plt.figure(figsize=(100,200))
#remove missing rows from Violation_Location first
precintplotnomiss = precintplot.dropna(subset=['Violation_Location'])
precintplotnomiss.head(10).plot(x='Violation_Location', y='Ticket_Frequency', kind='bar')
plt.title("Violations by Precint (top 10)")
plt.xlabel('Precint')
plt.ylabel('Ticket Frequency')
plt.show()
目前没有回答
相关问题 更多 >
编程相关推荐