使用Python Beautifulsoup进行抓取，获取href的url，该url是一个链接

for i in soup.findAll('a', attrs={'class': 'reference internal'}): if "AccessAnalyzer" in i: print(i) link = i['href'] print(link) (output) <a class="reference internal" href="accessanalyzer.html">AccessAnalyzer</a> accessanalyzer.html

2条回答

网友

1楼 · 编辑于 2024-10-01 04:53:23

在检索HREF值之后，您必须进行一些额外的处理

您需要做的是获取源页面的基本URL路径，并附加HREF值

假设源页面是“https://example.com/stuff/source.html，该页面包含一个带有HREF“foo.html”的链接。您需要获取源页面的基本URL路径（即“https://example.com/stuff/“并附加HREF值以获取”https://example.com/stuff/foo.html“

您可以使用dirname函数来帮助您：

>>> dir = os.path.dirname('https://example.com/stuff/source.html')
>>> dir
'https://example.com/stuffl'

然后将两部分连接在一起：

>>> os.path.join(dir, "foo.html")
'https://example.com/stuff/foo.html'

网友

2楼 · 编辑于 2024-10-01 04:53:23

与what's described here.类似，我相信您实际上需要某种webdriver自动机（Selenium等）来模拟悬停并获取数据

相关问题更多 >

编程相关推荐

热门问题

热门文章