两列之间的字符串模式匹配和索引

2024-06-25 06:06:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有两个文本列的数据框。一列的列值(比如B列)基本上是另一列(比如a列)整个字符串的子字符串/部分。我想在每一列中找到模式,并检查A列字符串的位置或起始字母的趋势。因此我想生成三列,一列是子字符串的位置,另外两列是前面和后面的字符

以下是dataframe的外观:

| Col A     | Col B |
----------------------
AGHXXXJ002  | XXX   |
AGHGHJJ002  | GHJ   |
ABCRTGHP001 | RTGH  |
ABCDFFP01   | DFF   |
ABCXGHJD09  | XGH   |

现在,基于上述模式,我想生成两列:

| Col A     | Col B | Position                  | Preceding Chars | Following Chars |
-------------------------------------------------------------------------------------
AGHXXXJ002  | XXX   | [3, 5]                    |  AGH            | J002            |
 (Because XXX starts at index 3 and ends at 5)  |                 |                 |
AGHGHJJ002  | GHJ   | [3, 5]                    |  AGH            | J002            |
ABCRTGHP001 | RTGH  | [3, 6]                    |  ABC            | P001            |
ABCDFFP01   | DFFP  | [3, 5]                    |  ABC            | 01              |
ABCXGHJD09  | XGH   | [3, 5]                    |  ABC            | D09             |
HGMQQUTV01  | HGM   | [0, 2]                    |  NaN            | QQUTV01         |
GBHUJJS099  | BHU   | [1, 3]                    |  G              | JJS099          |

这是我想要的输出。我尝试使用for循环并删除子字符串,但从未执行,因此删除了代码。到目前为止,我一直在手动操作,但有超过5万行,这是不可行的。此外,“位置”列可以拆分为两个不同的列:“开始位置”和“结束位置”


Tags: 字符串模式colxxxabccharsghjagh
3条回答

因为我们处理的是行级操作和字符串,所以没有一个向量化的方法来实现这一点

让我们使用str.findnp.char.find来创建数据帧

#Note I've removed the spaces in your columns.
s = pd.DataFrame(df.apply(lambda x : x['ColA'].split(x['ColB']),axis=1).tolist())
idx = df.apply(lambda x : np.char.find(x['ColA'],x['ColB']),1)

pos = zip(idx.values, (idx - 1 + df["ColB"].str.len()).values)

df["Position"] = list(pos)
df['Proceeding Chars'], df['Following Chars'] = s[0], s[1]

print(df)

        ColA  ColB Position Proceeding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 5)              AGH            J002
1   AGHGHJJ002   GHJ   (3, 5)              AGH            J002
2  ABCRTGHP001  RTGH   (3, 6)              ABC            P001
3    ABCDFFP01   DFF   (3, 5)              ABC             P01
4   ABCXGHJD09   XGH   (3, 5)              ABC            JD09
5   HGMQQUTV01   HGM   (0, 2)                          QQUTV01
6   GBHUJJS099   BHU   (1, 3)                G          JJS099
# Prepare test data

dct = {'Col A': {0: 'AGHXXXJ002',
  1: 'AGHGHJJ002',
  2: 'ABCRTGHP001',
  3: 'ABCDFFP01',
  4: 'ABCXGHJD09'},
 'Col B': {0: 'XXX', 1: 'GHJ', 2: 'RTGH', 3: 'DFF', 4: 'XGH'}}

df = pd.DataFrame.from_dict(dct)


tmp_lst = [x[0].split(x[1]) for x in zip(df['Col A'],df['Col B'])]         #  prepare temporary list with items: 'AGHXXXJ002'.split('XXX') -> [['AGH','J002'],.....]
df['Preceding Chars'] = [c[0] for c in tmp_lst]          # get first element ['AGH','J002'][0] -> 'AGH' 
df['Following Chars'] = [c[1] for c in tmp_lst]          # get second element ['AGH','J002'][1] -> 'J002' 
df['Position'] = [[len(i[0]), len(i[0])+ len(i[1])-1] for i in zip(df['Preceding Chars'], df['Col B'])]    

df
Out[1]:

    Col A       Col B   Preceding Chars Following Chars Position
0   AGHXXXJ002  XXX     AGH             J002            [3, 5]
1   AGHGHJJ002  GHJ     AGH             J002            [3, 5]
2   ABCRTGHP001 RTGH    ABC             P001            [3, 6]
3   ABCDFFP01   DFF     ABC             P01             [3, 5]
4   ABCXGHJD09  XGH     ABC             JD09            [3, 5]

也许,它会帮助你

>>> import re
>>> import pandas

>>> df = pandas.DataFrame([["AGHXXXJ002", "XXX"], ["AGHGHJJ002", "GHJ"], ["ABCRTGHP001", "RTGH"], ["ABCDFFP01", "DFF"], ["ABCXGHJD09", "XGH"]], columns=["Col A", "Col B"])
>>> df
         Col A Col B
0   AGHXXXJ002   XXX
1   AGHGHJJ002   GHJ
2  ABCRTGHP001  RTGH
3    ABCDFFP01   DFF
4   ABCXGHJD09   XGH

>>> def get_position(row):
...     match = re.search(row["Col B"], row["Col A"])
...     if match:
...             return match.span()
...     else:
...             return [-1, -1]
... 
>>> df["Position"] = df.apply(get_position, axis=1)
>>> df
         Col A Col B Position
0   AGHXXXJ002   XXX   (3, 6)
1   AGHGHJJ002   GHJ   (3, 6)
2  ABCRTGHP001  RTGH   (3, 7)
3    ABCDFFP01   DFF   (3, 6)
4   ABCXGHJD09   XGH   (3, 6)

>>> def get_preceding(row):
...     if row["Position"][0] == -1:
...             return ""
...     return row["Col A"][:row["Position"][0]]
... 
>>> df["Preceding Chars"] = df.apply(get_preceding, axis=1)
>>> df
         Col A Col B Position Preceding Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH
1   AGHGHJJ002   GHJ   (3, 6)             AGH
2  ABCRTGHP001  RTGH   (3, 7)             ABC
3    ABCDFFP01   DFF   (3, 6)             ABC
4   ABCXGHJD09   XGH   (3, 6)             ABC

>>> def get_following(row):
...     if row["Position"][1] == -1:
...             return ""
...     return row["Col A"][row["Position"][1]:]
... 
>>> df["Following Chars"] = df.apply(get_following, axis=1)
>>> df
         Col A Col B Position Preceding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH            J002
1   AGHGHJJ002   GHJ   (3, 6)             AGH            J002
2  ABCRTGHP001  RTGH   (3, 7)             ABC            P001
3    ABCDFFP01   DFF   (3, 6)             ABC             P01
4   ABCXGHJD09   XGH   (3, 6)             ABC            JD09

相关问题 更多 >