选择其他列行满足两个条件的索引问题的回答

选择其他列行满足两个条件的索引

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我不确定这个问题的措辞是否足够精确，我希望代码示例能够更好地解释这个问题 我有数据帧： <pre><code> links title url https://example.com /feed.xml EXAMPLE https://example.com /tags.html EXAMPLE https://example.com /tags.html EXAMPLE https://example.com /about EXAMPLE https://example.com /feed.xml EXAMPLE https://example.com /feed.xml EXAMPLE https://example222.com /about/ EXAMPLE222 https://example222.com /about/ EXAMPLE222 https://example333.com /atom.xml EXAMPLE333 https://example333.com /archives EXAMPLE333 https://example333.com /about EXAMPLE333 https://example333.com /archives EXAMPLE333 </code></pre> 索引设置为url。但我也可以把它当作一列数字索引 如何仅选择在<code>links</code>列中同时包含<code>.xml</code>和<code>archive</code>字符串的索引（url） 即 <pre><code>https://example333.com /atom.xml EXAMPLE333 https://example333.com /archives EXAMPLE333 </code></pre> 但不是 <pre><code>https://example222.com /about/ EXAMPLE222 https://example222.com /about/ EXAMPLE222 </code></pre> 显然，即使只满足一个条件，simple<code>.str.contains('archive|xml')</code>也会选择行 在本例中，它还将选择： <pre><code>https://example.com /feed.xml EXAMPLE https://example.com /tags.html EXAMPLE </code></pre> 这不是我想要的 有或没有<code>set_index</code>的解决方案都是好的

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

第一个想法是使用<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html" rel="nofollow noreferrer">^{<cd1>}</a>表示<code>Series</code>，并转换为<code>set</code>，如果每个组中都存在两个值，则进行比较： <pre><code>s = df['links'].str.extract('(archive|xml)', expand=False) m = s.groupby(df['url']).apply(set) >= set(['xml','archive']) </code></pre> 然后<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html" rel="nofollow noreferrer">^{<cd4>}</a>在原始数据中屏蔽，并用另一个条件链接 <pre><code>df = df[df['url'].map(m) & s.notna()] #alternative #df = df[df['url'].map(m) & df['links'].str.contains('archive|xml')] print (df) url links title 8 https://example333.com /atom.xml EXAMPLE333 9 https://example333.com /archives EXAMPLE333 11 https://example333.com /archives EXAMPLE333 </code></pre> 如果需要每个<code>url</code>的唯一值，请添加<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">^{<cd6>}</a>： <pre><code>df = df[df['url'].map(m) & s.notna()].drop_duplicates(['url','links']) print (df) url links title 8 https://example333.com /atom.xml EXAMPLE333 9 https://example333.com /archives EXAMPLE333 </code></pre> 另一种方法是在2个helper列中计算匹配的值，并测试这两个列是否匹配inf，并将求和值与<a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html" rel="nofollow noreferrer">^{<cd7>}</a>进行比较： <pre><code>a = df['links'].str.contains('archive') b = df['links'].str.contains('xml') mask = df.assign(a=a,b=b).groupby('url')['a','b'].transform('sum').gt(0).all(axis=1) df = df[mask & (a | b)] print (df) 8 https://example333.com /atom.xml EXAMPLE333 9 https://example333.com /archives EXAMPLE333 11 https://example333.com /archives EXAMPLE333 </code></pre>

选择其他列行满足两个条件的索引

1 个回答

相关Python问题