<p>Pyspark实现这一点的方法:</p>
<p>对于<strong><code>array_contains</code></strong>您只需使用表达式<strong><code>F.expr</code></strong>即可将<strong><code>value</code></strong>部分作为列发送</p>
<pre><code>from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0)).otherwise(F.col("clicked_url")))\
.withColumn("boolean", F.expr("""array_contains(hrefs,clicked_url)"""))\
.filter("boolean=true or clicked_url=0").drop("boolean").show()
+ + + -+
| query| hrefs|clicked_url|
+ + + -+
| car| [url1, url10]| url1|
|monkey| [url11, url20]| url11|
|donkey| [url31, url40]| 0|
| ball|[url41, url45, ur...| url45|
+ + + -+
</code></pre>
<p>由于<strong><code>.filter</code></strong>也可以接受<strong><code>expression</code></strong>,因此您只能在那里输入<strong><code>array_contains</code></strong></p>
<pre><code>from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0))\
.otherwise(F.col("clicked_url")))\
.filter("array_contains(hrefs,clicked_url)=true or clicked_url=0").show()
+ + + -+
| query| hrefs|clicked_url|
+ + + -+
| car| [url1, url10]| url1|
|monkey| [url11, url20]| url11|
|donkey| [url31, url40]| 0|
| ball|[url41, url45, ur...| url45|
+ + + -+
</code></pre>