使用python robotpars

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt") In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead. # Please see http://en.wordpress.com/firehose/ for more details. Sitemap: http://anilattech.wordpress.com/sitemap.xml User-agent: IRLbot Crawl-delay: 3600 User-agent: * Disallow: /next/ # har har User-agent: * Disallow: /activate/ User-agent: * Disallow: /signup/ User-agent: * Disallow: /related-tags.php # MT refugees User-agent: * Disallow: /cgi-bin/ User-agent: * Disallow:""") In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/") Out[48]: True

2条回答

网友

1楼 · 编辑于 2024-06-26 11:21:43

这里有两个问题。首先，rp.parse方法接受一个字符串列表，因此您应该向该行添加.split("\n")。在

第二个问题是*用户代理的规则存储在rp.default_entry中，而不是{}。如果检查它是否包含Entry对象。在

我不确定是谁的错，但是解析器的Python实现只考虑第一个User-agent: *部分，因此在您给出的示例中，只允许/next/。其他不允许的行将被忽略。我还没看过说明书，所以我不能说这是不是一个畸形机器人.txt如果Python代码错误。但我会假设是前者。在

网友

2楼 · 编辑于 2024-06-26 11:21:43

我刚找到答案。在

1。事情是这样的机器人.txt[来自wordpress.com网站]包含多个用户代理声明。robotparser模块不支持此功能。我用一种小小的方法删除了过多的User-agent: *行，解决了这个问题。在

2。正如安德鲁所指出的，解析的参数是list。在

相关问题更多 >

编程相关推荐

热门问题

热门文章