<p>第一个想法是使用<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html" rel="nofollow noreferrer">^{<cd1>}</a>表示<code>Series</code>,并转换为<code>set</code>,如果每个组中都存在两个值,则进行比较:</p>
<pre><code>s = df['links'].str.extract('(archive|xml)', expand=False)
m = s.groupby(df['url']).apply(set) >= set(['xml','archive'])
</code></pre>
<p>然后<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html" rel="nofollow noreferrer">^{<cd4>}</a>在原始数据中屏蔽,并用另一个条件链接</p>
<pre><code>df = df[df['url'].map(m) & s.notna()]
#alternative
#df = df[df['url'].map(m) & df['links'].str.contains('archive|xml')]
print (df)
url links title
8 https://example333.com /atom.xml EXAMPLE333
9 https://example333.com /archives EXAMPLE333
11 https://example333.com /archives EXAMPLE333
</code></pre>
<p>如果需要每个<code>url</code>的唯一值,请添加<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">^{<cd6>}</a>:</p>
<pre><code>df = df[df['url'].map(m) & s.notna()].drop_duplicates(['url','links'])
print (df)
url links title
8 https://example333.com /atom.xml EXAMPLE333
9 https://example333.com /archives EXAMPLE333
</code></pre>
<p>另一种方法是在2个helper列中计算匹配的值,并测试这两个列是否匹配inf,并将求和值与<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html" rel="nofollow noreferrer">^{<cd7>}</a>进行比较:</p>
<pre><code>a = df['links'].str.contains('archive')
b = df['links'].str.contains('xml')
mask = df.assign(a=a,b=b).groupby('url')['a','b'].transform('sum').gt(0).all(axis=1)
df = df[mask & (a | b)]
print (df)
8 https://example333.com /atom.xml EXAMPLE333
9 https://example333.com /archives EXAMPLE333
11 https://example333.com /archives EXAMPLE333
</code></pre>