pyspark方法仅获取更新和新记录

2024-10-04 01:27:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用的是pyspark2.1,下面是我的数据帧

昨天数据

1,纳格拉吉,凯沙夫,2017-11-20 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038

2,Raghu,人力资源,2017-11-20 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038

今天的数据

1,纳格拉吉,K,2017-11-21 00:02:39.867000000 2017-11-21 00:02:39.867000000

2,拉格胡,人力资源,2017-11-21 00:02:39.867000000 2017-11-20 00:02:39.867000000

3,Ramya,Govindaraju,2017-11-21 00:02:39.867000000 2017-11-20 00:02:39.867000000

我的输出

1,纳格拉吉,K,2017-11-21 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038

3,Ramya,Govindaraju,2017-11-21 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038

我不应该得到在两个数据帧中都存在的记录,因为名称中只有第一个记录发生了变化,我应该得到这个记录,记录编号3是新记录。你知道吗

我用了下面的逻辑

df =today_data_df.select("id").subtract(yesterdata_data_df.select("id")).toDF('d1').join(today_data_df,col('d1')==today_data_df.id).drop('d1')

输出为:

3,Ramya,Govindaraju,2017-11-21 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038

但我应该得到下面给出的请帮助

1,纳格拉吉,K,2017-11-21 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038 3,Ramya,Govindaraju,2017-11-21 00:02:39.867000002017-11-20 00:02:39.8670000020171120060038


Tags: 数据名称iddfdatatoday记录逻辑
1条回答
网友
1楼 · 发布于 2024-10-04 01:27:34

我假设有一个名称字段包含','

ydata=[(1,'Nagraj,Keshav','2017-11-20 00:02:39.867000000','2017-11-20 00:02:39.867000000',20171120060038),(2,'Raghu,HR','2017-11-20 00:02:39.867000000','2017-11-20 00:02:39.867000000',20171120060038)]
yschema=['id','name','fdate','tdate','stamp']
tdata=[(1,'Nagraj,K','2017-11-21 00:02:39.867000000','2017-11-21   00:02:39.867000000',20171120060038),(2,'Raghu,HR','2017-11-21 00:02:39.867000000','2017-11-20 00:02:39.867000000',20171120060038),(3,'Ramya,Govindaraju','2017-11-21 00:02:39.867000000','2017-11-20 00:02:39.867000000',20171120060038)]
ydf=spark.createDataFrame(ydata,yschema)
tdf=spark.createDataFrame(tdata,yschema)
newdf=tdf.select('id','name').subtract(ydf.select('id','name'))

newdf.join(tdf,newdf['id']==tdf['id']).drop(tdf['id']).drop(tdf['name']).show()

输出:

    | id|             name|               fdate|               tdate|           stamp|
    + -+        -+          +          +       +
    |  1|         Nagraj,K|2017-11-21 00:02:...|2017-11-21   00:02:...|20171120060038|
    |  3|Ramya,Govindaraju|2017-11-21 00:02:...|2017-11-20 00:02:...|20171120060038|

相关问题 更多 >