如何使用URLExtract读取weblog文件使用python获取唯一URL

60.80.94.184 anonymous Moz/2.0 (iPhone; CPU iPhone OS 9_0 like Mac OS X) Apple/65.1.90 (HTML, like Gecko) Version/12.0 Mobile/15E Safari/604.1 2012-06-22 03:43:51 - 60.80.17.54 8090 0 781 9843 SSL-tunnel - qs.rtoas.zp:80 Upstream 0 0x3 Allowed 180.81.82.170 anonymous iPad1,3/09.1.1 (16q0) 2012-06-24 04:53:57 - 90.80.97.54 8070 47 217 8440 http GET http://init-p0.pu.apple.com/bag?v=9 Upstream 200 0x400 Allowed 109.13.61.195 anonymous clo/76.119 Network/95.0.3 Dain/1.2.0 2012-06-25 09:43:54 - 190.22.19.94 8220 0 517 5057 SSL-tunnel - eree-022.opt-2.icloud-content.com:443 Upstream 0 0x8 Allowed 20.81.82.110 anonymous iPad1,1/09.1.1 (46q5) 2012-06-27 14:53:57 - 40.10.27.54 8070 47 217 8440 http GET https://qwe-pu.uoras.com/bag?v=19 Upstream 200 x00 Allowed

1条回答

网友

1楼 · 发布于 2024-09-30 01:25:09

urlextract将只从lines变量中的文本提取唯一的URL。但您将所有内容附加在一起，若多行具有相同的URL，则会多次出现该URL。在

如果URL的顺序不是问题，而您只想让它们唯一，请尝试以下操作：

from urlextract import URLExtract

files = "WEB_000.w3c"
extractor = URLExtract()
urls_unique = set()
with open(files, 'r', encoding='utf-8') as f:
    for line in f:
        urls = extractor.find_urls(lines, only_unique=True)
        urls_unique |= set(urls)
print(urls_unique)

相关问题更多 >

编程相关推荐

热门问题

热门文章