如何将mongo db对象id强制转换为PySpark数据类型

2024-10-02 18:27:22 发布

您现在位置:Python中文网/ 问答频道 /正文

根据this ticket,众所周知,PySpark没有针对BSON对象id的数据类型的任何直接实现(如果我错了,请纠正我,因为我主要是自学)

But after observing the schema in PySpark, I thought that following way may work for such a type of object.

StructField('userId',ArrayType(StructType([StructField("oid", StringType())])))

通过在写入Mongo DB时添加模式,它起到了作用

但我们都知道

When using filters with DataFrames or the Python API, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark.

那么,如果某个键的值为BSON对象Id类型,是否有任何方法可以从该键进行筛选?

因为当我使用上述模式方法进行阅读时,它给了我以下错误

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage 107.0 (TID 3633, ip-10-0-1-254.ec2.internal, executor 3): com.mongodb.spark.exceptions.MongoTypeConversionException:
 Cannot cast OBJECT_ID into a ArrayType(StructType(StructField(oid,StringType,true)),true) (value: BsonObjectId{value=6e8868d4b1d369a754863961})
        at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:214)
        at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:37)
        at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:35)
  

Tags: theto对象incomsqlmongodbstage