基于Pandas中的特定子字符串或模式拆分字符串

2024-10-01 22:30:42 发布

您现在位置:Python中文网/ 问答频道 /正文

谢谢你对我上一个问题的帮助。这很有用

现在还有一个问题,我一直在思考。这是基于我之前的问题。我有我清理过的输入,我想拆分主要公司名称,并根据一些子字符串或模式将其放在单独的列中

以下是我的意见:

Original_Input                                                 Cleansed_Input
Iris Diagnostics, a Division of Iris International Inc         Iris Diagnostics a Division of Iris 
                                                               International Inc
GINGI-PAK a division of The Belport Co., Inc.                  GINGIPAK a division of The Belport Co Inc
Plastiflex Healthcare Division of Plastiflex Group NV          Plastiflex Healthcare Division of 
                                                               Plastiflex Group NV
Heuer International (A division of GST Corporation Limited)    Heuer International A division of GST 
                                                               Corporation Limited
Arrow International, Inc. (subsidiary of Teleflex, Inc.)       Arrow International Inc subsidiary of 
                                                               Teleflex Inc
Filtertek, B.V. (An ITW Medical Company)                       Filtertek BV An ITW Medical Company
Fitz c/o YBI                                                   Fitz co YBI

我的预期产出是:

Original_Input                                                 Cleansed_Input
Iris Diagnostics, a Division of Iris International Inc         Iris Diagnostics a Division of Iris International Inc
GINGI-PAK a division of The Belport Co., Inc.                  GINGIPAK a division of The Belport Co Inc
Plastiflex Healthcare Division of Plastiflex Group NV          Plastiflex Healthcare Division of Plastiflex Group NV
Heuer International (A division of GST Corporation Limited)    Heuer International A division of GST Corporation Limited
Arrow International, Inc. (subsidiary of Teleflex, Inc.)       Arrow International Inc subsidiary of Teleflex Inc
Filtertek, B.V. (An ITW Medical Company)                       Filtertek BV An ITW Medical Company
Fitz c/o YBI                                                   Fitz co YBI

Parent_company
Iris Diagnostics
GINGIPAK 
Plastiflex Healthcare 
Heuer International
Arrow International Inc 
Filtertek BV
Fitz 

因此,“A分部”、“分部”、“A”、“A”、“子公司”、“C/O”之前的字符串或单词应被视为母公司

我使用的代码如下:

data['Parent_Company'] = re.sub('A division of','',str(data['Cleansed_Input']))

我没有获得所需的输出。我希望这些分隔符中的字符串结束为空白,并且只放置公司名称,或者希望将这些分隔符之前的名称拆分并作为父公司放置

提前谢谢你的帮助


Tags: ofirisinputcompanyincdivisiondiagnosticsinternational
1条回答
网友
1楼 · 发布于 2024-10-01 22:30:42

你可以用正则表达式和apply来完成。像这样的方法应该会奏效:

import pandas as pd
import re

def get_parent_company(input):
    keywords = ["a division of", "co", "subsidiary of", "division of","an"]
    regex = r"(.*?)(\b{}\b)".format("\\b|\\b".join(keywords))
    matches = re.finditer(regex, input, re.IGNORECASE)
    for match in matches:
        return match.group(1).strip()

df["Parent_Company"] = df["Cleansed_Input"].apply(get_parent_company)
print(df)

输出:

                                      Cleansed_Input           Parent_Company
0  Iris Diagnostics a Division of Iris Internatio...         Iris Diagnostics 
1          GINGIPAK a division of The Belport Co Inc                 GINGIPAK 
2  Plastiflex Healthcare Division of Plastiflex G...    Plastiflex Healthcare 
3  Heuer International A division of GST Corporat...      Heuer International 
4  Arrow International Inc subsidiary of Teleflex...  Arrow International Inc 
5                Filtertek BV An ITW Medical Company             Filtertek BV 
6                                        Fitz co YBI                     Fitz 

解释

最后的正则表达式如下所示:

(.*?)(\ba division of\b|\bco\b|\bsubsidiary of\b|\bdivision of\b|\ban\b)

(.*?)是我们想要的捕获组。它表示所有字符.*,但只表示尽可能少的次数?。这是必需的,以便它与第一次出现匹配。否则我们的对手

GINGIPAK a division of The Belport Co Inc

会是

GINGIPAK a division of The Belport

因为最后一个匹配是Co,这也是我们的关键字之一,但我们希望匹配第一个a division of

其余的都是我们想要的带有OR |的关键字,以便它匹配其中任何一个。我们把\b放在前后,这样它就可以与确切的单词匹配,否则就与

Heuer International A division of GST Corporation Limited

会是

Heuer International A division of GST 

因为在公司里我们有公司,但我们只想把公司作为一个整体来匹配

最后,我们使用第一个匹配项match.group(1).strip(),即(.*?),并删除结尾的空格

相关问题 更多 >

    热门问题