<p>如果您认为任何netloc都是相同的,那么可以使用<a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse" rel="nofollow">^{<cd1>}</a>进行解析</p>
<pre><code>from urllib.parse import urlparse # python2 from urlparse import urlparse
u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"
print(urlparse(u).netloc)
</code></pre>
<p>这会给你:</p>
<pre><code>www.myurlnumber1.com
</code></pre>
<p>因此,要获得独特的Netloc,您可以执行以下操作:</p>
<pre><code>unique = {urlparse(u).netloc for u in urls}
</code></pre>
<p>如果要保留url方案:</p>
<pre><code>urls = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]
unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)
</code></pre>
<p>假设它们都有方案,而您没有相同netloc的http和https,并认为它们是相同的。你知道吗</p>
<p>如果还要添加路径:</p>
<pre><code>unique = {u.netloc, u.path) for u in map(urlparse, urls)}
</code></pre>
<p>文档中列出了属性表:</p>
<pre><code>Attribute Index Value Value if not present
scheme 0 URL scheme specifier scheme parameter
netloc 1 Network location part empty string
path 2 Hierarchical path empty string
params 3 Parameters for last path element empty string
query 4 Query component empty string
fragment 5 Fragment identifier empty string
username User name None
password Password None
hostname Host name (lower case) None
port Port number as integer, if present None
</code></pre>
<p>你只需要使用你认为独特的部分。你知道吗</p>
<pre><code>In [1]: from urllib.parse import urlparse
In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux", "www.url.com/foo-bar?t=baz"]
In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}
In [4]:
In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}
</code></pre>