解析机器人.txt并检查http状态

2024-05-19 08:59:46 发布

男 | 程序猿一只，喜欢编程写python代码。

我在解析时遇到了一些问题机器人.txt在python中。我想把每一行机器人.txt在一个数组中，它现在正在工作。你知道吗

之后，我想检查url和数组中的每个值（这将是一个唯一的url），以及请求页面时收到的状态码。例如，数组中有值“/abc”，url是“https://stackoverflow.com”。然后我想检查URL“https://stackoverflow.com/abc”中的http状态码并打印出来。你知道吗

到目前为止我得到的代码是：

import os
import os
import io
import urllib.request
import urllib.parse
import urllib.error



#Command to Use User Input as URL:
#url = input("Input Url" + '\n')

url = 'https://stackoverflow.com/robots.txt'
raw_robots = urllib.request.urlopen(url)
robots= raw_robots.read().decode('utf-8')
result_data_set = {"Disallowed":[], "Allowed":[]}

for line in robots.split("\n"):
    if line.startswith('Allow'):    # this is for allowed url
        result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info
    elif line.startswith('Disallow'):    # this is for disallowed url
        result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info

print (result_data_set)

url2 = 'https://stackoverflow.com'

for x in result_data_set:
    try:
        conn = urllib.request.urlopen(url2+x)
    except urllib.error.HTTPError as e:
    # Return code error (e.g. 404, 501, ...)
    # ...
        print('HTTPError: {}'.format(e.code))
    except urllib.error.URLError as e:
    # Not an HTTP-specific error (e.g. connection refused)
    # ...
        print('URLError: {}'.format(e.reason))
    else:
    # 200
    # ...
        print(x+'good')

如果有任何帮助，我将不胜感激。你知道吗

Tags： https import com url for data line error

0条回答

目前没有回答

解析机器人.txt并检查http状态

相关问题更多 >

编程相关推荐

热门问题

热门文章

解析机器人.txt并检查http状态

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >