清理名称的正则表达式

2024-09-30 01:24:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个名称的数据帧。数据帧较长,但我使用top3作为示例

First list name examples: 
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li

Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J

我需要在这两个dfs中提取名字,如何在一个正则表达式中提取它们

The return should be 
list 1
JOSHEPH 
MIMI
WANT

list 2
DNNIES
LINDA
FARLEY

我当前的代码看起来是re.search(r'(?<=,)\w+', df['name']),但它没有返回正确的名称。可以用Python编写两个正则表达式代码来提取这些名称吗


Tags: 数据代码name名称示例johnexampleslist
2条回答

似乎您要在这里查找的是之后的任何位置都没有逗号的第一个单词字符序列,而不是之前有逗号的字符序列。因此,似乎您需要的不是积极的前瞻性断言,而是消极的前瞻性断言

尝试将以下内容用作正则表达式:

r'\w+(?!.*,)'

使用以下方法应用此功能:

df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())

将上述内容应用于此示例数据帧:

                name   foo
0     JOSEPH W. JOHN     1
1     MIMI N. ALFORD     3
2         WANG E. Li     3
3    AAMIR, DENNIS M     3
4  MAHAMMED, LINDA X     3
5     ABAD, FARLEY J     3

给出:

0    JOSEPH
1      MIMI
2      WANG
3    DENNIS
4     LINDA
5    FARLEY

使用

df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)

proof

解释

                                        
  (?:                      group, but do not capture:
                                        
    (?<=                     look behind to see if there is:
                                        
      ^                        the beginning of the string
                                        
      (?!                      look ahead to see if there is not:
                                        
        .*                       any character except \n (0 or more
                                 times (matching the most amount
                                 possible))
                                        
        ,                        ','
                                        
      )                        end of look-ahead
                                        
    )                        end of look-behind
                                        
   |                        OR
                                        
    (?<=                     look behind to see if there is:
                                        
      ,                        ', '
                                        
    )                        end of look-behind
                                        
  )                        end of grouping
                                        
  (                        group and capture to \1:
                                        
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
                                        
  )                        end of \1

相关问题 更多 >

    热门问题