删除重复的url结构 - 问答 - Python中文网

删除重复的url结构

2024-09-28 21:49:09 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在写一个爬虫，我有一个列表，其中包含一组类似于

你知道吗somesite.com/colection/id/index.php？如果=12
你知道吗somesite.com/索引.php？内径=14
你知道吗somesite.com/索引.php？内径=156
你知道吗example.com/view.php？图像=441
你知道吗somesite.com/page.php？id=sas231
你知道吗example.com/view.php？ivideo=4
你知道吗somesite.com/page.php？内径=56
你知道吗example.com/view.php？图像=1

我想在域之后用相同的结构解析url，然后得到第一个url，就像Burp套件一样，它有一个将来可以删除重复的url（相同的参数但不同的值）。你知道吗

你知道吗somesite.com/colection/id/index.php？如果=12
你知道吗somesite.com/索引.php？内径=14
你知道吗example.com/view.php？图像=441
你知道吗somesite.com/page.php？内径=asa231
你知道吗example.com/view.php？ivideo=4

如您所见，相同但具有不同查询字符串的页面已被删除。这就是我想要存档的。我试过很多正则表达式，但都没用。有人能帮我解决这个问题吗。提前谢谢。对不起，我的英语不好。你知道吗

Tags：图像 com view id url 列表 index example

1条回答

网友

1楼 · 发布于 2024-09-28 21:49:09

您可以使用urlparse库将url拆分为多个部分，然后提取所需的部分。例如：

>>> from urllib.parse import urlparse
>>> urlparse('http://somesite.com/page.php?id=sas231')
ParseResult(scheme='http', netloc='somesite.com', path='/page.php', params='', query='id=sas231', fragment='')

python3版本库的文档位于urlparse

相关问题更多 >

编程相关推荐

热门问题

热门文章