基于广播变量的pyspark过滤器数据帧

2024-10-03 17:22:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pyspark2.0数据帧,我正在尝试根据一个(相对)短的列表进行筛选,可能长度为50-100。在

filterList = ['A','B','C']

我希望将该列表广播到我的每个节点,并使用它删除列表中两列之一不在列表中的记录。在

此操作有效:

^{pr2}$

但是当我把名单广播出去的时候,我得到了一个错误:

filterListB= sc.broadcast(filterList)

filter_df= df.where((df['Foo'].isin(filterListB)) | (df['Bar'].isin(filterListB)))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-99-1b972cf29148> in <module>()
----> 1 filter_df= df.where((df['Foo'].isin(filterListB)) | (df['Bar'].isin(filterListB)))

/usr/local/spark/python/pyspark/sql/column.pyc in isin(self, *cols)
    284         if len(cols) == 1 and isinstance(cols[0], (list, set)):
    285             cols = cols[0]
--> 286         cols = [c._jc if isinstance(c, Column) else _create_column_from_literal(c) for c in cols]
    287         sc = SparkContext._active_spark_context
    288         jc = getattr(self._jc, "isin")(_to_seq(sc, cols))

/usr/local/spark/python/pyspark/sql/column.pyc in _create_column_from_literal(literal)
     33 def _create_column_from_literal(literal):
     34     sc = SparkContext._active_spark_context
---> 35     return sc._jvm.functions.lit(literal)
     36 
     37 

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1122 
   1123     def __call__(self, *args):
-> 1124         args_command, temp_args = self._build_args(*args)
   1125 
   1126         command = proto.CALL_COMMAND_NAME +\

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in _build_args(self, *args)
   1092 
   1093         args_command = "".join(
-> 1094             [get_command_part(arg, self.pool) for arg in new_args])
   1095 
   1096         return args_command, temp_args

/usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
    287             command_part += ";" + interface
    288     else:
--> 289         command_part = REFERENCE_TYPE + parameter._get_object_id()
    290 
    291     command_part += "\n"

AttributeError: 'Broadcast' object has no attribute '_get_object_id'

关于我应该如何根据广播列表过滤pyspark2.0数据帧,有什么想法吗?在


Tags: inselfdf列表usrlocalargscommand
1条回答
网友
1楼 · 发布于 2024-10-03 17:22:04

不能直接访问DataFrame函数中的广播变量,而是使用“value”访问广播变量的值。在

因此,请按如下方式修改代码:

filterListB= sc.broadcast(filterList)
filter_df= df.where((df['Foo'].isin(filterListB.value)) | (df['Bar'].isin(filterListB.value)))

参考号:https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html

相关问题 更多 >