PySpark删除ngrams中的空白

2024-09-29 19:31:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图生成3个字母的n-gram,但是SparkNGram在每个字母之间插入一个空格。我想删除(或不生成)此空白。我可以分解阵列,删除空白,然后重新组装阵列,但这将是一个非常昂贵的操作。最好,我还希望避免由于PySpark UDF的性能问题而创建UDF。使用PySpark内置函数是否有更便宜的方法

from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *


wordDataFrame = spark.createDataFrame([
    (0, "Hello I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic regression models are neat")
], ["id", "words"])

pipeline = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams")
    ])

model = pipeline.fit(wordDataFrame).transform(wordDataFrame)

model.show()

电流输出为:

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[h, e, l, l, o,  ...|[h e l, e l l,   ...|
+---+--------------------+--------------------+--------------------+

但我们需要的是:

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hello I heard ab ...|[h, e, l, l, o,  ...|[hel, ell, llo,  ...|
+---+--------------------+--------------------+--------------------+

Tags: fromimportidpipeline字母ml空白pyspark
1条回答
网友
1楼 · 发布于 2024-09-29 19:31:07

您可以使用高阶函数transformregexspark2.4+)来实现这一点(假设ngarms列是arraytype和stringtype)

#sampledataframe
df.show()
+ -+        +       -+       +
| id|           words|         tokens|        ngrams|
+ -+        +       -+       +
|  0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+ -+        +       -+       +

from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()

+ -+        +       -+     +
| id|           words|         tokens|    ngrams|
+ -+        +       -+     +
|  0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+ -+        +       -+     +

相关问题 更多 >

    热门问题