from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]
def get_domain(url, tlds):
url_elements = urlparse(url)[1].split('.')
# url_elements = ["abcde","co","uk"]
for i in range(-len(url_elements), 0):
last_i_elements = url_elements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
exception_candidate = "!" + candidate
# match tlds:
if (exception_candidate in tlds):
return ".".join(url_elements[i:])
if (candidate in tlds or wildcard_candidate in tlds):
return ".".join(url_elements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print get_domain("http://abcde.co.uk", tlds)
tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.
使用Mozilla网站上的someone else选项:
结果:
如果有人让我知道上面的哪一部分可以用更像Python的方式重写,我会很感激的。例如,必须有一种更好的方法来遍历
last_i_elements
列表,但我想不出一种方法。我也不知道ValueError
是否是最好的培养对象。评论?下面是一个很好的python模块,有人在看到这个问题后编写了这个模块来解决这个问题: https://github.com/john-kurkowski/tldextract
模块在Public Suffix List中查找tld,由Mozilla志愿者覆盖
引用:
相关问题 更多 >
编程相关推荐