如何在Python中只提取字符串的完整单词?

2024-06-28 20:27:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我只想提取字符串的完整单词

我有这个df:

                     Students  Age
0           Boston Terry Emma   23
1      Tommy Julien Cambridge   20
2                      London   21
3                New York Liu   30
4  Anna-Madrid+       Pauline   26
5         Mozart    Cambridge   27
6             Gigi Tokyo Lily   18
7      Paris Diane Marie Dive   22

我想从字符串中提取完整的单词,而不是其中的一部分(例如:如果Liu是用名字写的,我想要Liu,如果Liu是用名字写的,我想要iu,如果Liu是用名字写的,我想要iu,因为Liu不是iu。)

cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']

所需df:

                     Students  Age     Cities   Liked Names
0           Boston Terry Emma   23     Boston          Emma
1      Tommy Julien Cambridge   20  Cambridge  Tommy Julien
2                      London   21     London           NaN
3                New York Liu   30   New York           NaN
4  Anna-Madrid+       Pauline   26     Madrid       Pauline
5         Mozart    Cambridge   27  Cambridge           NaN
6             Gigi Tokyo Lily   18      Tokyo           NaN
7      Paris Diane Marie Dive   22      Paris           NaN

我尝试了以下代码:

pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)

我的城市代码有效,我只需要修复“喜欢的名字”的问题

如何做到这一点?非常感谢


Tags: dfnewtommynanbostonjulienlondonyork
3条回答

您可以进行额外的检查,查看匹配的名称是否在Students列中

import numpy as np

def check(row):
    if row['Liked Names'] == row['Liked Names']:
        # If `Liked Names` is not nan

        # Get all possible names
        patterns = row['Students'].split(' ')

        # If matched `Liked Names` in `Students`
        isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])

        if not isAllMatched:
            return np.nan
        else:
            return row['Liked Names']
    else:
        # If `Liked Names` is nan, still return nan
        return np.nan

df['Liked Names'] = df.apply(check, axis=1)
# print(df)

                     Students  Age     Cities   Liked Names
0           Boston Terry Emma   23     Boston          Emma
1      Tommy Julien Cambridge   20  Cambridge  Tommy Julien
2                      London   21     London           NaN
3                New York Liu   30   New York           NaN
4  Anna-Madrid+       Pauline   26     Madrid       Pauline
5         Mozart    Cambridge   27  Cambridge           NaN
6             Gigi Tokyo Lily   18      Tokyo           NaN
7      Paris Diane Marie Dive   22      Paris           NaN

我想你要找的是词的界限。在正则表达式中,它们可以用\b表示。一个难看的(尽管可行)解决方案是修改liked_names列表以包括单词边界,然后运行代码:

l = [
    ["Boston Terry Emma", 23],
    ["Tommy Julien Cambridge", 20],
    ["London", 21],
    ["New York Liu", 30],
    ["Anna-Madrid+       Pauline", 26],
    ["Mozart    Cambridge", 27],
    ["Gigi Tokyo Lily", 18],
    ["Paris Diane Marie Dive", 22],
]

cities = [
    "Boston",
    "Cambridge",
    "Bruxelles",
    "New York",
    "London",
    "Amsterdam",
    "Madrid",
    "Tokyo",
    "Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])

pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)

print(df)

更好的解决方案是在创建正则表达式时包含单词边界

我第一次尝试使用\s,即空格,但在列表的末尾不起作用,因此\b是解决方案。您可以查看https://regular-expressions.mobi/wordboundaries.html?wlr=1以了解一些详细信息

您可以尝试以下正则表达式:

liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]

pat = (
    "(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)

df["Liked Names"] = df["Students"].str.extract(pat)
print(df)

印刷品:

                     Students  Age   Liked Names
0           Boston Terry Emma   23          Emma
1      Tommy Julien Cambridge   20  Tommy Julien
2                      London   21           NaN
3                New York Liu   30           Liu
4  Anna-Madrid+       Pauline   26       Pauline
5         Mozart    Cambridge   27           NaN
6             Gigi Tokyo Lily   18           NaN
7      Paris Diane Marie Dive   22           NaN

相关问题 更多 >