在python中清除废弃的url

/preferences?hl=en&someting /preferences?hl=en&someting /history/something /history/something /support?pr=something /support?pr=something http://www.web1.com/parameters http://www.web1.com/parameters http://www.web2.com/parameters http://www.web2.com/parameters

2条回答

网友

1楼 · 编辑于 2024-09-27 07:25:43

因为re.findall返回项目列表，所以在匹配的项目[]周围得到^{

link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
# pay attention on iteration over set(links) and not links
for link in set(links):
    print link

请注意，我已经将^{}创建添加到for loop中，以仅获取唯一链接，这样您就可以防止打印相同的url。你知道吗

网友
2楼 · 编辑于 2024-09-27 07:25:43

尝试使用
links = re.findall('href="(http.*?)"', sourceCode) links = sorted(set(links)) for link in links: print(links)
这将只获取以http开头的链接，并删除重复项并对其排序

相关问题更多 >

编程相关推荐

热门问题

热门文章