使用大写和小写字符串将字符串列拆分为两个单独的列Pyspark/Python/Sql?

2024-10-01 00:15:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我在点击栏中有以下数据:

    MEM-BEN-BTN-CLK-entertainment-audible
    MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA
    MEM-BEN-BTN-CLK-entertainment-games
    MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50
    MEM-BEN-BTN-LRN-learn-more-aarp-travel-center-powered-by-expedia-10083
    MEM-BEN-BTN-LRN-learn-more-embassy-suites-by-hilton-1019

我想将列单击拆分为两列单击上部和下部

单击上方(保留所有大写字符)

    MEM-BEN-BTN-CLK
    MEM-BEN-LOC-MODAL-LOCATION-INPUT
    MEM-BEN-BTN-CLK
    MEM-BEN-BTN-CLK
    MEM-BEN-BTN-LRN
    MEM-BEN-BTN-LRN

单击\u lower(保存所有小写字符)

    entertainment-audible
    Birmingham, AL, USA
    entertainment-games
    healthandwellness-love-and-meaning-after-50
    learn-more-aarp-travel-center-powered-by-expedia-10083
    learn-more-embassy-suites-by-hilton-1019

我试图使用split()函数,但是有多个分隔符(-),并且字符串的长度不同,因此代码对我不起作用。我也试过了,但它断了线

如果能在这方面得到任何指导或帮助,我将不胜感激


Tags: inputbymorelocationmemlearnlocben
2条回答

splittext后面加上-再加上lower case-再加上string Startingwithcaps but followed with lowercase letters

分割之后,我们可以切片first element in list,这将给我们upper

一旦我们有了上限,remove the upper从全文中保留lower

下面的代码和享受编码

资料

data=[
  (1,"MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50"),
  (2,"MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA")
  ]
df=spark.createDataFrame(data, ['id','text'])
df.show(truncate=False)

代码

df.withColumn('upper', F.split('text','\\-(?=[a-z]+)|(\\-[A-Z][a-z]+)')[0]).withColumn("lower",expr("regexp_replace(text,upper,'')")).show(truncate=False)

我使用regex语句来拆分字符串。您可以使用re.group(x)方法访问这两个组。这里有更多信息:https://docs.python.org/3/library/re.html

import re

strings = ["MEM-BEN-BTN-CLK-entertainment-audible",
    "MEM-BEN-LOC-MODAL-LOCATION-INPUT-Birmingham, AL, USA",
    "MEM-BEN-BTN-CLK-entertainment-games",
    "MEM-BEN-BTN-CLK-healthandwellness-love-and-meaning-after-50",
    "MEM-BEN-BTN-LRN-learn-more-aarp-travel-center-powered-by-expedia-10083",
    "MEM-BEN-BTN-LRN-learn-more-embassy-suites-by-hilton-1019"]

regex = "(?P<Click_Upper>[A-Z\-]+)-(?P<Click_Lower>.*)"

for string in strings:
    print(re.match(regex,string).groups())

以下是输出:

('MEM-BEN-BTN-CLK', 'entertainment-audible')
('MEM-BEN-LOC-MODAL-LOCATION-INPUT', 'Birmingham, AL, USA')
('MEM-BEN-BTN-CLK', 'entertainment-games')
('MEM-BEN-BTN-CLK', 'healthandwellness-love-and-meaning-after-50')
('MEM-BEN-BTN-LRN', 'learn-more-aarp-travel-center-powered-by-expedia-10083')
('MEM-BEN-BTN-LRN', 'learn-more-embassy-suites-by-hilton-1019')

相关问题 更多 >