Python:UserWarning:此模式具有匹配组。要实际获取组，请使用str.ex

member_id,event_path,event_time,event_duration 30595,"2016-03-30 12:27:33",yandex.ru/,1 30595,"2016-03-30 12:31:42",yandex.ru/,0 30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:49",kinogo.co/,1 30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

url 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_ 003\.ru\/sonyxperia 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23 1click\.ru\/sonyxperia 1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False) substr = urls.url.values.tolist() data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000) result = pd.DataFrame() for i, df in enumerate(data): res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

3条回答

网友

1楼 · 编辑于 2024-05-18 08:45:04

消除警告的另一种方法是更改regex，使其成为匹配组而不是捕获组。这是(?:)符号。

因此，如果匹配组是(url1|url2)，则应该用(?:url1|url2)替换。

网友

2楼 · 编辑于 2024-05-18 08:45:04

urls中至少有一个正则表达式模式必须使用捕获组。 str.contains只为df['event_time']中的每一行返回True或False-- 它不使用捕获组。因此，UserWarning提醒你正则表达式使用捕获组，但不使用匹配。

如果希望删除UserWarning，可以从regex模式中找到并删除捕获组。它们不会显示在您发布的regex模式中，但它们必须在您的实际文件中。在字符类之外查找括号。

或者，您可以通过放置

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

在调用str.contains之前。

下面是一个简单的示例，演示了问题（和解决方案）：

# import warnings
# warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning

import pandas as pd

df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})

urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning
# urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.

substr = urls.url.values.tolist()
df[df['event_time'].str.contains('|'.join(substr), regex=True)]

印刷品

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  df[df['event_time'].str.contains('|'.join(substr), regex=True)]

从regex模式中删除捕获组：

urls = pd.DataFrame({'url': ['g.*']})

避免用户警告。

网友

3楼 · 编辑于 2024-05-18 08:45:04

由于提供了regex=True，因此sublist被视为regex，在您的示例中，它包含捕获组（用括号括起来的字符串）。

您得到警告是因为如果您想要捕获某些内容，那么str.contains就没有用处（根据提供的模式是否包含在字符串中，返回boolean）

Obviously, you can suppress the warnings but it's better to fix them.

如果您真的想捕获某些内容，请转义括号块或使用str.extract。

相关问题更多 >

编程相关推荐

热门问题

热门文章