如何从URL中提取顶级域名(TLD)

2024-05-17 03:20:46 发布

您现在位置:Python中文网/ 问答频道 /正文

如何从URL中提取域名,不包括任何子域?

我最初的简单化尝试是:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这适用于http://www.foo.com,但不适用于http://www.foo.com.au。 有没有一种方法可以在不使用关于有效tld(顶级域)或国家代码(因为它们会改变)的专门知识的情况下正确地做到这一点。

谢谢


Tags: 方法子域comhttpurlfoowwwau
2条回答

使用Mozilla网站上的someone else选项:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果:

abcde.co.uk

如果有人让我知道上面的哪一部分可以用更像Python的方式重写,我会很感激的。例如,必须有一种更好的方法来遍历last_i_elements列表,但我想不出一种方法。我也不知道ValueError是否是最好的培养对象。评论?

下面是一个很好的python模块,有人在看到这个问题后编写了这个模块来解决这个问题: https://github.com/john-kurkowski/tldextract

模块在Public Suffix List中查找tld,由Mozilla志愿者覆盖

引用:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

相关问题 更多 >