擅长:python、mysql、java
<p>您的<code>.filter</code>返回一个错误,因为它是dataframes上的sql filter函数(需要<code>BooleanType()</code>列),而不是rdd上的filter函数。如果要使用RDD,只需添加<code>.rdd</code>:
</p>
<pre class="lang-py prettyprint-override"><code>small_DF.rdd.filter(lambda x: any(word in x.text for word in test_list))
</code></pre>
<p>您不必使用UDF,您可以在pyspark中使用正则表达式,列上有<code>.rlike</code>:</p>
<pre class="lang-py prettyprint-override"><code>from pyspark.sql import HiveContext
hc = HiveContext(sc)
import pyspark.sql.functions as psf
words = [x.lower() for x in ['starbucks', 'Nvidia', 'IBM', 'Dell']]
data = [['i love Starbucks'],['dell laptops rocks'],['help me I am stuck!']]
df = hc.createDataFrame(data).toDF('text')
df.filter(psf.lower(df.text).rlike('|'.join(words)))
</code></pre>