基于CSV的Spark数据帧PySpark列名

2024-06-01 07:06:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我得到了以下数据帧:

+--------+---------------+--------------------+---------+
|province|           city|      infection_case|confirmed|
+--------+---------------+--------------------+---------+
|   Seoul|     Yongsan-gu|       Itaewon Clubs|      139|
|   Seoul|      Gwanak-gu|             Richway|      119|
|   Seoul|        Guro-gu| Guro-gu Call Center|       95|
|   Seoul|   Yangcheon-gu|Yangcheon Table T...|       43|
|   Seoul|      Dobong-gu|     Day Care Center|       43|

现在我想根据CSV文件更改列名(第一行),如下所示:

province,any_other__name
city,any_other__name      
infection_case,any_other__name
confirmed,any_other__name   

这是我的代码:

cases = spark.read.load("/home/tool/Desktop/database/TEST/archive/Case.csv",format="csv", sep=",", inferSchema="true", header="true")
cases = cases.select('province','city','infection_case','confirmed')
cases \
  .write \
  .mode('overwrite') \
  .option('header', 'true') \
  .csv('8.csv')

Tags: csvnametruecityanycenterothercase
3条回答

# Define K,V pair in form of (old_name, new_name). Then 
# By using withColumnRenamed update all required columns

schema = {
        'province':'any_province__name',
        'city':'any_city__name',     
        'infection_case':'any_infection_case__name',
        'confirmed':'any_confirmed__name' 
      }

def rename_column(df=None,schema=None):
    for columns in df.columns:
        df = df.withColumnRenamed(columns,schema[columns])
    return df

df_final = rename_column(df=df,schema=schema)

最好的解决方案是使用^{}方法

for line in open("path/to/file.csv"):
    old_name, new_name = line.strip().split(",")
    cases = cases.withColumnRenamed(old_name, new_name)

这里的解决方案 pyspark中使用selectExpr()重命名使用“as”关键字将列“Old\u name”重命名为“New\u name”

cases = cases.selectExpr("province as names1", "city as names2", "confirmed as names3")

相关问题 更多 >