TypeError:不能在Python中的byteslike对象上使用字符串模式

import re from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class EmailSpider(CrawlSpider): name = 'EmailScraper' emailHistory = {} custom_settings = { 'ROBOTSTXT_OBEY': False # ,'DEPTH_LIMIT' : 6 } emailRegex = re.compile((r"([a-zA-Z0-9_{|}~-]+(?:\.[a-zA-Z0-9_" r"{|}~-]+)*(@)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9]){2,}?(\." r"))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)")) def __init__(self, url=None, *args, **kwargs): super(EmailSpider, self).__init__(*args, **kwargs) self.start_urls = [url] self.allowed_domains = [url.replace( "http://", "").replace("www.", "").replace("/", "")] rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),) def parse_item(self, response): emails = re.findall(EmailSpider.emailRegex, response._body) for email in emails: if email[0] in EmailSpider.emailHistory: continue else: EmailSpider.emailHistory[email[0]] = True yield { 'site': response.url, 'email': email[0] }

1条回答

网友

1楼 · 发布于 2024-05-08 11:43:20

response._body不是str（字符串对象），因此不能对其使用re（regex）。如果您查找它的对象类型，您会发现它是一个bytes（bytes对象）。你知道吗

>>> type(response._body)
<class 'bytes'>

通过把它解码成UTF-8之类的东西，问题就应该解决了。你知道吗

>>> type(response._body.decode('utf-8'))
<class 'str'>

最后的re是这样的：

emails = re.findall(EmailSpider.emailRegex, response._body.decode('utf-8'))

相关问题更多 >

编程相关推荐

热门问题

热门文章