Python-pandas从带有短语的单元格中提取连字词

2024-09-30 18:33:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含短语的数据帧,我只想从数据帧中提取由连字符分隔的复合词,并将它们放在另一个数据帧中。在

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})

到目前为止,我得到的是:

^{pr2}$

结果

>>> new
            part1        part2
0  Trail 1 Yellow        Green
1        Kim Jong  il was here
2             NaN          NaN
3          methyl       butane
4         Derp da    derp derp
5             Pok        e-mon

我想要的是这个单词应该是这样的(注意Pok-e-mon由于有两个连字符而显示为Nan):

>>> new
            part1        part2
0          Yellow        Green
1             Jong          il
2             NaN          NaN
3          methyl       butane
4              da         derp
5             NaN          NaN

Tags: 数据greennan字符ildajongtrail
2条回答

鉴于规格,我看不出您的第一行Nan, Nan来自何处。可能是你的例子中的打字错误?无论如何,这里有一个可能的解决办法。在

import re

# returns words with at least one hyphen
def split_phrase(phrase):
    return re.findall('(\w+(?:-\w+)+)', phrase)

# get all words with hyphens
words_with_hyphens = sum(df.Phrases.apply(split_phrase).values)
# split all words into parts
split_words = [word.split('-') for word in words_with_hyphens]
# keep words with two parts only, else return (Nan, Nan)
new_data = [(ws[0], ws[1]) if len(ws) == 2 else (np.nan, np.nan) for ws in split_words]
# create the new DataFrame
pd.DataFrame(new_data, columns=['part1', 'part2'])

#  part1   | part2
#         
# 0 Yellow | Green
# 1 Jong   | il
# 2 methyl | butane
# 3 da     | derp
# 4 NaN    | NaN

您可以使用以下正则表达式:

(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)

(?:               # non capturing group
    [^-\w]|^        # a non-hyphen or the beginning of the string
)
(?P<part1>
    [a-zA-Z]+     # at least a letter
)-(?P<part2>
    [a-zA-Z]+
)
(?:[^-\w]|$)        # either a non-hyphen character or the end of the string
  • 您的第一个问题是没有任何东西可以阻止.占用空间。[a-zA-Z]只选择字母,这样可以避免从一个单词跳到另一个单词。在
  • 对于pok-e-mon的情况,您需要检查是否在匹配之前或之后没有连字符。在

Demo here

相关问题 更多 >