如何从列中的字符串中提取与python列表中的另一个字符串相匹配的子字符串

2024-10-02 02:31:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,如下所示:

     col 1                                     col 2
0       59       538 Walton Avenue, Chester, FY6 7NP
1       62 42 Chesterton Road, Peterborough, FR7 2NY
2      179       3 Wallbridge Street, Essex, 4HG 3HT
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP

列表类似于:

[Stevenage, Essex, Coventry, Chester]

按照这里的解决方案:How to check if Pandas rows contain any full string or substring of a list?如下所示:

city_list = list(cities["name"])
df["col3"] = np.where(df["col2"].str.contains('|'.join(city_list)), df["col2"], '')

我发现col2中的一些字符串与列表中的字符串匹配,但col3与col2相同。我希望col3是列表中的值,而不是与col3相同的值。这将是:

     col 1                                     col 2     col3
0       59       538 Walton Avenue, Chester, FY6 7NP  Chester 
1       62 42 Chesterton Road, Peterborough, FR7 2NY 
2      179       3 Wallbridge Street, Essex, 4HG 3HT    Essex
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP Coventry

我试过:

pat = "|".join(cities.name)
df.insert(0, "name", df["col2"].str.extract('(' + pat + ')', expand = False))

但这返回了一个错误,在预期为1时显示456个输入

此外:

df["col2"] = df["col2"].apply(lambda x: difflib.get_close_matches(x, cities["name"])[0])
df.merge(cities)

但这是错误列表索引超出范围后返回的

有什么办法可以这样做吗?df1大约有160000个条目,每个地址在col2中来自不同的国家,因此没有标准的书写方式,而城市列表大约有170000个条目

多谢各位


Tags: namedf列表collistcol2col3cities
3条回答

依靠这样的辅助功能:

df = pd.DataFrame({'col 1': [59, 62, 179, 180],
                   'col 2': ['538 Walton Avenue, Chester, FY6 7NP',
                             '42 Chesterton Road, Peterborough, FR7 2NY',
                             '3 Wallbridge Street, Essex, 4HG 3HT',
                             '6 Stevenage Avenue, Coventry, 7PY 9NP'
                             ]})

def aux_func(x):

    # split by comma and select the interesting part ([1])
    x = x.split(',')
    x = x[1]

    aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']
    for v in aux_list:
        if v in x:
            return v
    return ""

df['col 3'] = [aux_func(name) for name in df['col 2']]

你可以这样做:

city_list = ["Stevenage", "Essex", "Coventry", "Chester"]

def get_match(row):
    col_2 = row["col 2"].replace(",", " ").split() # Here you should process the string as you want
    for c in city_list:
        if difflib.get_close_matches(col_2, c)
            return c
    return ""

df["col 3"] = df.apply(lambda row: get_match(row), axis=1)

查看str.contains函数,该函数测试模式是否与序列匹配:

df = pd.DataFrame([[59, '538 Walton Avenue, Chester,', 'FY6 7NP'],
                   [62, '42 Chesterton Road, Peterborough', '4HG 3HT'],
                   [179, '3 Wallbridge Street, Essex', '4HG 3HT'],
                   [180, '6 Stevenage Avenue, Coventry', '7PY 9NP']])
city_list = ["Stevenage", "Essex", "Coventry", "Chester"]
for city in city_list:
    df.loc[df[1].str.contains(city), 'match'] = city

相关问题 更多 >

    热门问题