<p><strong><em>创建数据集</em></strong></p>
<pre><code>myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
</code></pre>
<p><strong>方法1:</strong>这种方法会比较慢,因为它涉及到通过<code>pivot</code>的传输。在</p>
^{pr2}$
<p>{<cd2>你可以使用<cd2>方法。这会快得多。在</p>
<pre><code>from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))\
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))\
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
</code></pre>