我有一个日志文件。或多或少他们看起来像这样。我想把它们清理一点,并得到正确的秩序,因为这是真正的链接
想知道是否有人知道如何在py(spark)中编写正则表达式以获得去干燥的输出
1:
https%3A%2F%2Fwww.btv.com%2Fnews%2Ffinland%2Fartikel%2F5174938%2Fzwemmer-zoekactie-julianadorp-kinderen-gered
Desired Output
https://www.btv.com/news/finland/artikel/5174938/zwemmer-zoekactie-julianadorp-kinderen-gered
2:
https%3A%2F%2Fwww.weather.com%2F
Desired Output
https://www.weather.com
3:
https%3A%2F%2Fwww.weather.com%2Ffinland%2Fneerslag%2Fweather%2F3uurs
Desired Output
https://www.weather.com/finland/neerslag/ weather /uurs
我试过几个解决方案,但没有太多的理解
\b\w+\b(?!\/)
from pyspark.sql.functions import regexp_extract, col
regexp_extract(column_name, regex, group_number)
regex('(.)(by)(\s+)(\w+)')
提前谢谢
您可以使用^{} ,并且必须生成一个udf才能将其与pyspark一起使用
输出:
相关问题 更多 >
编程相关推荐