Python-pandas从带有短语的单元格中提取连字词

>>> new part1 part2 0 Trail 1 Yellow Green 1 Kim Jong il was here 2 NaN NaN 3 methyl butane 4 Derp da derp derp 5 Pok e-mon

>>> new part1 part2 0 Yellow Green 1 Jong il 2 NaN NaN 3 methyl butane 4 da derp 5 NaN NaN

2条回答

网友

1楼 · 编辑于 2024-09-30 18:33:33

鉴于规格，我看不出您的第一行Nan, Nan来自何处。可能是你的例子中的打字错误？无论如何，这里有一个可能的解决办法。在

import re

# returns words with at least one hyphen
def split_phrase(phrase):
    return re.findall('(\w+(?:-\w+)+)', phrase)

# get all words with hyphens
words_with_hyphens = sum(df.Phrases.apply(split_phrase).values)
# split all words into parts
split_words = [word.split('-') for word in words_with_hyphens]
# keep words with two parts only, else return (Nan, Nan)
new_data = [(ws[0], ws[1]) if len(ws) == 2 else (np.nan, np.nan) for ws in split_words]
# create the new DataFrame
pd.DataFrame(new_data, columns=['part1', 'part2'])

#  part1   | part2
#         
# 0 Yellow | Green
# 1 Jong   | il
# 2 methyl | butane
# 3 da     | derp
# 4 NaN    | NaN

网友

2楼 · 编辑于 2024-09-30 18:33:33

您可以使用以下正则表达式：

(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)

(?:               # non capturing group
    [^-\w]|^        # a non-hyphen or the beginning of the string
)
(?P<part1>
    [a-zA-Z]+     # at least a letter
)-(?P<part2>
    [a-zA-Z]+
)
(?:[^-\w]|$)        # either a non-hyphen character or the end of the string

您的第一个问题是没有任何东西可以阻止.占用空间。[a-zA-Z]只选择字母，这样可以避免从一个单词跳到另一个单词。在
对于pok-e-mon的情况，您需要检查是否在匹配之前或之后没有连字符。在

见Demo here

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python-pandas从带有短语的单元格中提取连字词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >