从数据帧中删除Python中具有正则表达式模式的单词

import pandas as pd data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']} df = pd.DataFrame(data) df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True) print(df)

2条回答

网友

1楼 · 编辑于 2024-09-28 19:27:56

您可以直接使用Series.str.contains创建一个掩码，并在以下操作之前禁用用户警告，之后启用用户警告：

import pandas as pd
import warnings

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning

输出：

>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>     
7     laptop
8    welcome
9     pencil
Name: Random, dtype: object

您的正则表达式包含一个问题：量词被放在组之外，并且\1正在寻找错误的重复字符串。而且，\b字边界是超边界。([a-z]+)[a-z]?\1模式匹配一个或多个字母，然后匹配任意一个可选字母，以及紧跟其后的相同子字符串

见regex demo

我们可以安全地禁用用户警告，因为我们在这里故意使用捕获组，因为我们需要在这个正则表达式模式中使用反向引用。该警告需要重新启用，以避免在代码的其他部分中使用不必要的捕获组

网友

2楼 · 编辑于 2024-09-28 19:27:56

IIUC，您可以使用类似于r'(\w+)(\w)?\1'的模式，即一个或多个字母、可选字母和第一个匹配的字母。这将产生正确的结果：

df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

相关问题更多 >

编程相关推荐

热门问题

热门文章