如何从URL中提取顶级域名（TLD）

2条回答

网友

1楼 · 编辑于 2024-05-17 03:20:46

使用Mozilla网站上的someone else选项：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果：

abcde.co.uk

如果有人让我知道上面的哪一部分可以用更像Python的方式重写，我会很感激的。例如，必须有一种更好的方法来遍历last_i_elements列表，但我想不出一种方法。我也不知道ValueError是否是最好的培养对象。评论？

网友

2楼 · 编辑于 2024-05-17 03:20:46

下面是一个很好的python模块，有人在看到这个问题后编写了这个模块来解决这个问题： https://github.com/john-kurkowski/tldextract

模块在Public Suffix List中查找tld，由Mozilla志愿者覆盖

引用：

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从URL中提取顶级域名（TLD）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >