在pyspark dataframe的其余列中搜索column1中的值

2条回答

网友

1楼 · 编辑于 2024-09-28 05:29:06

是的，您可以利用sparksql.isin运算符。在

让我们首先在示例中创建数据帧

第1部分-创建数据帧

cSchema = StructType([StructField("id", IntegerType()),\
StructField("col1", IntegerType()),\
StructField("col2", IntegerType()),\
StructField("col3", IntegerType()),\
StructField("col4", IntegerType())])


test_data = [[1,4,10,4,6],[2,6,3,6,1],[3,6,0,2,1],[4,8,8,6,1],[5,9,6,6,9]]


df = spark.createDataFrame(test_data,schema=cSchema)

df.show()

+ -+  +  +  +  +
| id|col1|col2|col3|col4|
+ -+  +  +  +  +
|  1|   4|  10|   4|   6|
|  2|   6|   3|   6|   1|
|  3|   6|   0|   2|   1|
|  4|   8|   8|   6|   1|
|  5|   9|   6|   6|   9|
+ -+  +  +  +  +

第2部分-搜索匹配值的函数

isin：一个布尔表达式，如果该表达式的值包含在参数的计算值中，则该表达式的计算结果为true。 http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

^{pr2}$

这会指引你正确的方向。您可以只选择Id列等。。或者你想要归还的东西。可以很容易地更改该函数，以获取更多的列进行搜索。希望这有帮助！在

网友

2楼 · 编辑于 2024-09-28 05:29:06

# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
                      StructField("col1", IntegerType()),
                      StructField("col2", IntegerType()),
                      StructField("col3", IntegerType()),
                      StructField("col4", IntegerType())])

test_data = [['as1', 4, 10, 4, 6],
             ['as2', 6, 3, 6, 1],
             ['as3', 6, 0, 2, 1],
             ['as4', 8, 8, 6, 1],
             ['as5', 9, 6, 6, 9]]

# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)

df.show()

# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']

# implement the search of values in rest columns found in col 1
def search(distinct_list ):
    for i in distinct_list :
        print(str(i) + ' found in: ')

        # for col in df.columns:
        for col in col_list:
            df_search = df.select(*col_list) \
                .filter(df[str(col)] == str(i))

            if (len(df_search.head(1)) > 0):
                df_search.show()


search(distinct_list)

在GITHUB处查找完整的示例代码

和13；

相关问题更多 >

编程相关推荐

热门问题

热门文章

在pyspark dataframe的其余列中搜索column1中的值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >