使用Python从href中提取完整的URL

parentUrl = urlQueue.get() html = get_page_source(parentUrl) bSoup = BeautifulSoup(html, 'html.parser') aTags = bSoup.find_all('a', href=True) for aTag in aTags: childUrl = aTag.get('href') # just to check if the url is complete or not(for .com only) if '.com' not in childUrl: # this urljoin is giving invalid resultsas mentioned above childUrl = urljoin(parentUrl, childUrl)

1条回答

网友

1楼 · 发布于 2024-09-24 00:32:10

只是做了些小动作。在您的例子中，传递带有斜杠的基URI。完成此操作所需的所有内容都将写入docs of urlparse

>>> import urlparse
>>> urlparse.urljoin('http://www.example.org/main/test','a.xml?value=basketball')
'http://www.example.org/main/a.xml?value=basketball'
>>> urlparse.urljoin('http://www.example.org/main/test/','a.xml?value=basketball')
'http://www.example.org/main/test/a.xml?value=basketball'

顺便说一句：这是一个完美的用例，可以将构建url的代码分解成一个单独的函数。然后编写一些单元测试来验证它是否按预期工作，甚至可以处理边缘情况。然后在你的网络爬虫代码中使用它。在

相关问题更多 >

编程相关推荐

热门问题

热门文章