<p>不管数据帧中的每个<code>package_id</code>发生<code>package_scan_code=03</code>多少次,此代码都应该可以工作。我又添加了一个<code>(123,'LosAngeles','03')</code>来演示-</p>
<p><strong>步骤1:</strong>创建数据帧</p>
<pre><code>values = [(123,'Denver','05'),(123,'LosAngeles','03'),(123,'Dallas','09'),(123,'Vail','02'),(123,'LosAngeles','03'),
(456,'Jacksonville','05'),(456,'Nashville','09'),(456,'Memphis','03')]
df = sqlContext.createDataFrame(values,['package_id','location','package_scan_code'])
</code></pre>
<p><strong>第2步:</strong>创建<code>package_id</code>和<code>location</code>的字典。在</p>
^{pr2}$
<p><strong>第3步:</strong>创建列,映射字典。在</p>
<pre><code>from pyspark.sql.functions import col, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*dict_location_scan_code.items())])
df = df.withColumn('origin', mapping_expr.getItem(col('package_id')))
df.show()
+ + + -+ +
|package_id| location|package_scan_code| origin|
+ + + -+ +
| 123| Denver| 05|LosAngeles|
| 123| LosAngeles| 03|LosAngeles|
| 123| Dallas| 09|LosAngeles|
| 123| Vail| 02|LosAngeles|
| 123| LosAngeles| 03|LosAngeles|
| 456|Jacksonville| 05| Memphis|
| 456| Nashville| 09| Memphis|
| 456| Memphis| 03| Memphis|
+ + + -+ +
</code></pre>