使用scrapy在splash浏览器中爬行时忽略某些链接

2条回答

网友

1楼 · 编辑于 2024-06-01 11:21:05

下面是我如何逃脱在飞溅和刮擦蜜罐链接。为此，我写了一个脚本如下

splash:on_request(function(request)
    if string.match(request.url,'^%l+://%w+.example.com') or 
       string.match(request.url,'^%l+://%w+[^%w]+%a+.example.com') then
       request.abort()
    end
    if string.match(request.url,'error.js') then
       print("## get error while page rendering ###")
       request.abort()
    end
end)

如果上面的代码与特定模式（蜜罐链接）匹配，则会删除该链接，或者错误.js请求时呼叫。 在splash中渲染时，第二个条件非常重要，因为如果你不处理这些类型的js，那么splash渲染引擎将挂起，并且永远不会给你控制权

网友
2楼 · 编辑于 2024-06-01 11:21:05

因此，有些网站有时会为蜘蛛和机器人设置防御措施，称为“蜜罐”。这些陷阱通常把机器人送到他们无法逃脱的死胡同。在处理URL时，您希望使用正则表达式筛选出在爬网站点时应免除的URL，并且您可以在爬行器进入链接之前通过此检查传递所有URL，或者在爬行器与模式不一致时让爬行器通过此检查。你知道吗
import re ..... pattern = re.compile(^www.[\w\d].(com|org|net|ng)$) #create a url pattern here, you will have to edit this to suit your needs for url in urls: match = pattern.search(url) if not match: continue else: #perform normal crawling/scraping activities
这是绕过这些链接的一种方法。希望有帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章