什么是最好的方式来刮这个网站？（非硒）问题的回答

什么是最好的方式来刮这个网站？（非硒）

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

在开始之前，TLDR位于底部 因此，我正试图根据用户输入的搜索结果，从<a href="https://rarbgmirror.com/" rel="nofollow noreferrer">https://rarbgmirror.com/</a>中搜寻torrent magnet链接及其torrent标题名称。我已经通过以下代码了解了如何使用BeautifulSoup和请求来实现这一点： <pre><code>from bs4 import BeautifulSoup import requests import re query = input("Input a search: ") link = 'https://rarbgmirror.com/torrents.php?search=' + query magnets = [] titles = [] try: request = requests.get(link) except: print("ERROR") source = request.text soup = BeautifulSoup(source, 'lxml') for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}): page_link = 'https://www.1377x.to/' + page_link.get('href') try: page_request = requests.get(page_link) except: print("ERROR") page_source = page_request.content page_soup = BeautifulSoup(page_source, 'lxml') link = page_soup.find('a', attrs={'href': re.compile("^magnet")}) magnets.append(link.get('href')) title = page_soup.find('h1') titles.append(title) print(titles) print(magnets) </code></pre> 我几乎可以肯定，这段代码中没有错误，因为代码最初是为<a href="https://1377x.to" rel="nofollow noreferrer">https://1377x.to</a>编写的，目的相同，如果您查看这两个网站的HTML结构，它们对磁铁链接和标题名称使用相同的标记。但是如果代码有错误，请向我指出 经过一些研究，我发现问题在于<a href="https://rarbgmirror.com/" rel="nofollow noreferrer">https://rarbgmirror.com/</a>使用JavaScript动态加载网页。因此，经过更多的研究，我发现硒被推荐用于此目的。使用selenium一段时间后，我发现使用它有一些缺点，例如： <ul> <li>刮削的缓慢速度</li> <li>运行应用程序的系统必须安装selenium浏览器（我计划使用pyinstaller打包应用程序，这将是一个问题）</li> </ul> 因此，我请求一种替代selenium的方法来抓取动态加载的网页 TLDR：我想要一个替代selenium的方法来抓取一个使用JavaScript动态加载的网站 PS:GitHub回购协议 <a href="https://github.com/eliasbenb/MagnetMagnet" rel="nofollow noreferrer">https://github.com/eliasbenb/MagnetMagnet</a>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

什么是最好的方式来刮这个网站？（非硒）

1 个回答

相关Python问题