如何形成RDD向量以传递给Pysparks相关函数?

2024-07-08 08:02:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试在Pyspark中执行pairwise correlation。我正在读取输入文件,然后从中生成一个dataframe。现在要把它传递给pyspark的关联函数,我需要把它转换成rdd of vectors。以下是我当前的代码:

    input = sc.textFile('File1.csv')
    header = input.first()  # extract header
    data = input.filter(lambda x: x != header)
    parsedInput = data.map(lambda l: l.split(","))

    # define schema
    schemaString = "col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12"
    fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
    schema = StructType(fields)

    df_i = sqlContext.createDataFrame(parsedInput, schema) 

现在,根据pyspark文件,this page这是计算相关性的方法:

^{pr2}$

如何将我的dataframedf_i转换为RDD of vectors以便将其传递给corr()?在

另外,如果有更好的方法(比我目前所拥有的方法)来读取输入文件并使用pyspark对该文件进行成对关联,那么请用一个例子来演示。在

更新:以下是我的数据输入示例:

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12
Cameroon,15 - 24 years old,8,160,408,3387232,no,no,0,82.7,116,0.712931034
Cameroon,15 - 24 years old,8,90,408,3683931,no,yes,39,94.8,89,1.065168539
Cameroon,15 - 24 years old,8,104,408,3663917,no,no,0,183.6,133,1.380451128
Cameroon,15 - 24 years old,8,96,408,3292045,no,no,0,144,102,1.411764706
Cameroon,25 - 39 years old,8,126,408,3399798,yes,no,0,197.6,126,1.568253968
Cameroon,15 - 24 years old,8,146,408,3483581,no,no,0,109,69,1.579710145
Cameroon,15 - 24 years old,8,34,408,3396446,no,no,0,128.8,80,1.61
Cameroon,15 - 24 years old,8,93,408,3607246,no,yes,42,166.9,101,1.652475248
Cameroon,15 - 24 years old,8,42,408,3577060,no,no,0,146.3,84,1.741666667
Cameroon,15 - 24 years old,8,57,408,3573817,no,yes,39,213,115,1.852173913
Cameroon,15 - 24 years old,8,94,408,3444022,no,no,0,207,109,1.899082569

Tags: 文件of方法lambdanoinputdataschema
1条回答
网友
1楼 · 发布于 2024-07-08 08:02:16

就这么做吧

from pyspark.mllib.linalg import Vectors

result = df_i
 .rdd # converts DataFrame to RDD of Rows 
 .map(lambda row : Vectors.dense([item for item in row])

在这里,我假设每个Row需要DataFrame中的每个列值(因此是item for item in row调用)。在

向量RDD的相关示例是here.rdd的相关文档是here。在

相关问题 更多 >

    热门问题