如何在PySpark中使用StructType将浮点数转换为IntegerType？

df = pd.read_csv('Amazon_Responded_Oct05.csv',error_bad_lines=False) df.head() >>>> user_id_str user_followers_count text_ 0 143515471.0 1503 @AmazonHelp Can you please DM me? A product I ... 1 85741735.0 149569 @SeanEPanjab I'm sorry, we're unable to DM you... 2 143515471.0 1503 @AmazonHelp It was purchased on... 3 143515471.0 1503 @AmazonHelp I am following you now, if it help... 4 85741735.0 149569 @SeanEPanjab Please give us a call/chat so we ...

data_schema = [StructField('user_followers_count',IntegerType(),True), StructField('user_id_str',StringType(),True), StructField('text',StringType(),True)] final_struc = StructType(fields=data_schema) data = spark.createDataFrame(df,schema=final_struc) >>>> TypeError: field user_followers_count: IntegerType can not accept object 143515471.0 in type <class 'float'>

df.astype({'user_id_str': 'int','user_followers_count':'int','text_':'str'}).dtypes df.head(1) >>>> user_id_str user_followers_count text_ 0 143515471.0 1503 @AmazonHelp Can you please DM me? A product I ...

1条回答

网友

1楼 · 发布于 2024-10-01 11:35:21

要将pandas数据帧转换为pyspark数据帧，请尝试以下操作

from pyspark.sql import Row
import pandas as pd
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

#create a sample pandas dataframe
data = {'a':['hello', 'hi', 'world'], 'b':[5.0, 6.4, 9.7], 'c':[1,2,3]}
df = pd.DataFrame(data)
'''
    a       b       c
0   hello   5.0     1
1   hi      6.4     2
2   world   9.7     3
'''

#convert second column type to integer
df = df.astype({'b':'int'})
df
'''
    a       b       c
0   hello   5       1
1   hi      6       2
2   world   9       3
'''

#prepare the schema
fields = [StructField('a',StringType(),True),\
               StructField('b',IntegerType(),True),\
               StructField('c',IntegerType(),True)]
schema = StructType(fields)


#convert to a pyspark dataframe
rows = [Row(**_) for _ in df.to_dict(orient='records')]
#[Row(a='hello', b=5, c=1), Row(a='hi', b=6, c=2), Row(a='world', b=9, c=3)]
df_sp = spark.createDataFrame(rows, schema)
df_sp.show()
# +  -+ -+ -+
# |    a|  b|  c|
# +  -+ -+ -+
# |hello|  5|  1|
# |   hi|  6|  2|
# |world|  9|  3|
# +  -+ -+ -+

相关问题更多 >

编程相关推荐

热门问题

热门文章