使用大写和小写字符串将字符串列拆分为两个单独的列Pyspark/Python/Sql？

MEM-BEN-BTN-CLK-entertainment-audible MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA MEM-BEN-BTN-CLK-entertainment-games MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50 MEM-BEN-BTN-LRN-learn-more-aarp-travel-center-powered-by-expedia-10083 MEM-BEN-BTN-LRN-learn-more-embassy-suites-by-hilton-1019

entertainment-audible Birmingham, AL, USA entertainment-games healthandwellness-love-and-meaning-after-50 learn-more-aarp-travel-center-powered-by-expedia-10083 learn-more-embassy-suites-by-hilton-1019

2条回答

网友

1楼 · 编辑于 2024-10-01 00:15:34

让split在text后面加上-再加上lower case或-再加上string Startingwithcaps but followed with lowercase letters

分割之后，我们可以切片first element in list，这将给我们upper

一旦我们有了上限，remove the upper从全文中保留lower

下面的代码和享受编码

资料

data=[
  (1,"MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50"),
  (2,"MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA")
  ]
df=spark.createDataFrame(data, ['id','text'])
df.show(truncate=False)

代码

df.withColumn('upper', F.split('text','\\-(?=[a-z]+)|(\\-[A-Z][a-z]+)')[0]).withColumn("lower",expr("regexp_replace(text,upper,'')")).show(truncate=False)

网友

2楼 · 编辑于 2024-10-01 00:15:34

我使用regex语句来拆分字符串。您可以使用re.group（x）方法访问这两个组。这里有更多信息：https://docs.python.org/3/library/re.html

import re

strings = ["MEM-BEN-BTN-CLK-entertainment-audible",
    "MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA",
    "MEM-BEN-BTN-CLK-entertainment-games",
    "MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50",
    "MEM-BEN-BTN-LRN-learn-more-aarp-travel-center-powered-by-expedia-10083",
    "MEM-BEN-BTN-LRN-learn-more-embassy-suites-by-hilton-1019"]

regex = "(?P<Click_Upper>[A-Z\-]+)-(?P<Click_Lower>.*)"

for string in strings:
    print(re.match(regex,string).groups())

以下是输出：

('MEM-BEN-BTN-CLK', 'entertainment-audible')
('MEM-BEN-LOC-MODAL-LOCATION-INPUT', 'Birmingham, AL, USA')
('MEM-BEN-BTN-CLK', 'entertainment-games')
('MEM-BEN-BTN-CLK', 'healthandwellness-love-and-meaning-after-50')
('MEM-BEN-BTN-LRN', 'learn-more-aarp-travel-center-powered-by-expedia-10083')
('MEM-BEN-BTN-LRN', 'learn-more-embassy-suites-by-hilton-1019')

相关问题更多 >

编程相关推荐

热门问题

热门文章