PySpark删除ngrams中的空白

from pyspark.ml import Pipeline, Model, PipelineModel from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram from pyspark.sql.functions import * wordDataFrame = spark.createDataFrame([ (0, "Hello I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic regression models are neat") ], ["id", "words"]) pipeline = Pipeline(stages=[ RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1), NGram(n=3, inputCol="tokens", outputCol="ngrams") ]) model = pipeline.fit(wordDataFrame).transform(wordDataFrame) model.show()

+---+--------------------+--------------------+--------------------+ | id| words| tokens| ngrams| +---+--------------------+--------------------+--------------------+ | 0|Hi I heard about ...|[h, e, l, l, o, ...|[h e l, e l l, ...| +---+--------------------+--------------------+--------------------+

+---+--------------------+--------------------+--------------------+ | id| words| tokens| ngrams| +---+--------------------+--------------------+--------------------+ | 0|Hello I heard ab ...|[h, e, l, l, o, ...|[hel, ell, llo, ...| +---+--------------------+--------------------+--------------------+

1条回答

网友

1楼 · 发布于 2024-09-29 19:31:07

您可以使用高阶函数transform和regex（spark2.4+）来实现这一点（假设ngarms列是arraytype和stringtype）

#sampledataframe
df.show()
+ -+        +       -+       +
| id|           words|         tokens|        ngrams|
+ -+        +       -+       +
|  0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+ -+        +       -+       +

from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()

+ -+        +       -+     +
| id|           words|         tokens|    ngrams|
+ -+        +       -+     +
|  0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+ -+        +       -+     +

相关问题更多 >

编程相关推荐

热门问题

热门文章