urllib RobotFileParser robots.txt中看似冲突的规则

2024-04-20 09:56:05 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是amazon.co.jp/robots.txt的相关部分:

User-agent: *
Disallow: /-/
Disallow: /gp/aw/shoppingAids/
Allow: /-/en/

我要检查的URL:"https://www.amazon.co.jp/-/en/035719/dp/B000H4W9WG/ref=sr_1_61?dchild=1&keywords=dot%20matrix%20printer&qid=1617229306&s=computers&sr=1-61"

现在,它符合disallow:Disallow: /-/和allow:Allow: /-/en/

urllib的RobotFileParser将URL标记为can_fetch=False。我检查了源代码,它似乎是按时间顺序进行的。由于首先是不允许,它将津贴标记为False,仅此而已

考虑到robots.txt标准,你知道这是否是正确的方法吗?因为对我来说,这似乎是违反直觉的,并且相信url应该被允许

相关代码:

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.amazon.co.jp/robots.txt")
rp.read()
can_ftch = rp.can_fetch("*", "https://www.amazon.co.jp/-/en/035719/dp/B000H4W9WG/ref=sr_1_61?dchild=1&keywords=dot%20matrix%20printer&qid=1617229306&s=computers&sr=1-61")

编辑:按照谷歌的标准,它应该像我想的那样工作。应该允许使用URLthe most specific rule based on the length of the [path] entry trumps the less specific (shorter) rule

https://developers.google.com/search/docs/advanced/robots/robots_txt#order-of-precedence-for-group-member-lines

EIDT2:进行了更多的挖掘,发现了这个qoute:

For Google and Bing, the rule is that the directive with the most characters wins. Here, that’s the disallow directive.

  • Disallow: /blog/ (6 characters)
  • Allow: /blog (5 charactors)

If the allow and disallow directives are equal in length, then the least restrictive directive wins. In this case, that would be the allow directive.

Crucially, this is only the case for Google and Bing. Other search engines listen to the first matching directive. In this case, that’s disallow.

根据这种逻辑,RobotFileParser确实是正确的