如何将pyspark数据帧分为两行

+--------------------+----------+ | userid| eventdate| +--------------------+----------+ |00518b128fc9459d9...|2017-10-09| |00976c0b7f2c4c2ca...|2017-12-16| |00a60fb81aa74f35a...|2017-12-04| |00f9f7234e2c4bf78...|2017-05-09| |0146fe6ad7a243c3b...|2017-11-21| |016567f169c145ddb...|2017-10-16| |01ccd278777946cb8...|2017-07-05|

3条回答

网友

1楼 · 编辑于 2024-05-17 03:21:17

Spark数据帧不能像您编写的那样被索引。您可以使用head方法创建取n个顶行。这将返回Row（）对象的列表，而不是数据帧。因此，您可以将它们转换回dataframe，并使用从原始dataframe中减去来获取其余行。

#Take the 100 top rows convert them to dataframe 
#Also you need to provide the schema also to avoid errors
df1 = sqlContext.createDataFrame(df.head(100), df.schema)

#Take the rest of the rows
df2 = df.subtract(df1)

如果使用spark 2.0+，也可以使用SparkSession而不是spark sqlContext。另外，如果您对前100行不感兴趣，并且希望进行随机拆分，则可以使用randomSplit如下：

df1,df2 = df.randomSplit([0.20, 0.80],seed=1234)

网友

2楼 · 编辑于 2024-05-17 03:21:17

起初我误解了，以为你想把柱子切成薄片。如果要选择行的子集，一种方法是使用^{}创建索引列。从文档中：

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

您可以使用此ID对数据帧进行排序，并使用limit()对其进行子集，以确保获得所需的行。

例如：

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

只是为了证实这是你想要的：

df1.count()
#100
df2.count()
#400

此外，我们还可以验证索引列是否重叠：

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+

网友

3楼 · 编辑于 2024-05-17 03:21:17

如果我不介意在两个数据帧中有相同的行，那么我可以使用sample。例如，我有一个354行的数据帧。

>>> df.count()
354

>>> df.sample(False,0.5,0).count() //approx. 50%
179

>>> df.sample(False,0.1,0).count() //approx. 10%
34

或者，如果我想在没有副本的情况下严格分开，我可以

df1 = df.limit(100)     //100 rows
df2 = df.subtract(df1)  //Remaining rows

相关问题更多 >

编程相关推荐

热门问题

热门文章