如何将pyspark数据帧的一行中的字节数组转换为一列字节？

df = pd.DataFrame({'content': [bytearray(b'\x01%\xeb\x8cH\x89')]}) spark.createDataFrame(df).show() +-------------------+ | content| +-------------------+ |[01 25 EB 8C 48 89]| +-------------------+

3条回答

网友

1楼 · 编辑于 2024-09-29 23:21:12

您需要在列上使用flatMap—传入一个函数来解析每个数据元素。您提供的函数应该发出一个序列。序列中的每个元素都将成为新行

这里有一个更详细的解释和更多的例子： https://koalatea.io/python-pyspark-flatmap/

网友

2楼 · 编辑于 2024-09-29 23:21:12

使用UDF将bytearray转换为数组可能会有所帮助

import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType,ArrayType
byte_to_int = lambda x : [int(y) for y in x]
byte_to_int_udf = f.udf(lambda z :byte_to_int(z),ArrayType(IntegerType()))
df = pd.DataFrame({'content': [bytearray(b'\x01%\xeb\x8cH\x89')]})
df1 = spark.createDataFrame(df)
df1.withColumn("content_array",byte_to_int_udf(f.col('content'))).select(f.explode(f.col('content_array'))).show()

网友

3楼 · 编辑于 2024-09-29 23:21:12

我在这里用注释应用了一系列转换。不过有点“黑”

from pyspark.sql import functions as F

(df
    .withColumn('content', F.hex('content')) # convert bytes to hex: 0125EB8C4889
    .withColumn('content', F.regexp_replace('content', '(\w{2})', '$1,')) # split hex to chunks: 01,25,EB,8C,48,89,
    .withColumn('content', F.expr('substring(content, 0, length(content) - 1)')) # remove redundent comma: 01,25,EB,8C,48,89
    .withColumn('content', F.split('content', ',')) # split hex values by comma: [01, 25, EB, 8C, 48, 89]
    .withColumn('content', F.explode('content')) # explode hex values to multiple rows
    .withColumn('content', F.conv('content', 16, 10)) # convert hex to dec
    .show(10, False)
)

# Output
# +   -+
# |content|
# +   -+
# |1      |
# |37     |
# |235    |
# |140    |
# |72     |
# |137    |
# +   -+

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将pyspark数据帧的一行中的字节数组转换为一列字节？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >