选择其他列行满足两个条件的索引

2024-10-04 05:32:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我不确定这个问题的措辞是否足够精确,我希望代码示例能够更好地解释这个问题

我有数据帧:

                                 links                         title
url                                                                         
https://example.com           /feed.xml                       EXAMPLE
https://example.com           /tags.html                      EXAMPLE
https://example.com           /tags.html                      EXAMPLE
https://example.com           /about                          EXAMPLE
https://example.com           /feed.xml                       EXAMPLE
https://example.com           /feed.xml                       EXAMPLE
https://example222.com        /about/                         EXAMPLE222
https://example222.com        /about/                         EXAMPLE222
https://example333.com        /atom.xml                       EXAMPLE333
https://example333.com        /archives                       EXAMPLE333
https://example333.com        /about                          EXAMPLE333
https://example333.com        /archives                       EXAMPLE333

索引设置为url。但我也可以把它当作一列数字索引

如何仅选择在links列中同时包含.xmlarchive字符串的索引(url)

https://example333.com        /atom.xml                       EXAMPLE333
https://example333.com        /archives                       EXAMPLE333

但不是

https://example222.com        /about/                         EXAMPLE222
https://example222.com        /about/                         EXAMPLE222

显然,即使只满足一个条件,simple.str.contains('archive|xml')也会选择行

在本例中,它还将选择:

https://example.com           /feed.xml                       EXAMPLE
https://example.com           /tags.html                      EXAMPLE

这不是我想要的

有或没有set_index的解决方案都是好的


Tags: httpscomurlexamplehtmlfeedtagsxml
3条回答

如果您只想获取与条件匹配的URL,下面是代码:

urls = df.groupby(level = 0).agg({'links': (lambda x: sum([(f in list(x.str.extract('(archive|xml)', expand=False))) for f in ['archive','xml']])==2)})['links']

print(urls)

Out[1]:
    https://example.com       False
    https://example222.com    False
    https://example333.com     True
    Name: links, dtype: bool

print(list(urls[urls].index))

Out[2]:
    ['https://example333.com']

执行groupby操作,然后应用自定义聚合函数:

def summarize(group):
    has_xml = group['links'].str.contains(r'\.xml')
    has_archive = group['links'].str.contains('archive')

    return group[has_xml | has_archive] if has_xml.any() and has_archive.any() else None

df.groupby('url').apply(summarize).reset_index(0, drop=True)

结果:

                       url      links       title
8   https://example333.com  /atom.xml  EXAMPLE333
9   https://example333.com  /archives  EXAMPLE333
11  https://example333.com  /archives  EXAMPLE333

第一个想法是使用^{}表示Series,并转换为set,如果每个组中都存在两个值,则进行比较:

s = df['links'].str.extract('(archive|xml)', expand=False)
m = s.groupby(df['url']).apply(set) >= set(['xml','archive'])

然后^{}在原始数据中屏蔽,并用另一个条件链接

df = df[df['url'].map(m) & s.notna()]
#alternative
#df = df[df['url'].map(m) & df['links'].str.contains('archive|xml')]
print (df)
                       url      links       title
8   https://example333.com  /atom.xml  EXAMPLE333
9   https://example333.com  /archives  EXAMPLE333
11  https://example333.com  /archives  EXAMPLE333

如果需要每个url的唯一值,请添加^{}

df = df[df['url'].map(m) & s.notna()].drop_duplicates(['url','links'])
print (df)
                      url      links       title
8  https://example333.com  /atom.xml  EXAMPLE333
9  https://example333.com  /archives  EXAMPLE333

另一种方法是在2个helper列中计算匹配的值,并测试这两个列是否匹配inf,并将求和值与^{}进行比较:

a = df['links'].str.contains('archive')
b = df['links'].str.contains('xml')

mask = df.assign(a=a,b=b).groupby('url')['a','b'].transform('sum').gt(0).all(axis=1)

df = df[mask & (a | b)]
print (df)
8   https://example333.com  /atom.xml  EXAMPLE333
9   https://example333.com  /archives  EXAMPLE333
11  https://example333.com  /archives  EXAMPLE333

相关问题 更多 >