我想把这个网站给毁了
www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=
我确实把它擦掉了,但我不能把电子邮件地址擦掉 你能帮我把它报废吗? 我用的是刮痧
# -*- coding: utf-8 -*-
import scrapy
from ..items import ChurchItem
class ChurchSpiderSpider(scrapy.Spider):
name = 'church_spider'
page_number = 1
start_urls = ['https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=']
def parse(self, response):
items = ChurchItem()
container = response.css(".icon-ministry")
for t in container:
church_name = t.css(".field-name-locator-ministry-title a::text").extract()
church_phone = t.css(".field-name-field-phone::text").extract()
church_address = t.css(".thoroughfare::text").extract()
church_email = t.css(".field-name-field-mu-email span::text").extract()
items["church_name"] = church_name
items["church_phone"] = church_phone
items["church_address"] = church_address
items["church_email"] = church_email
yield items
# next_page = 'https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=&page=' + str(ChurchSpiderSpider.page_number)
# if ChurchSpiderSpider.page_number <= 110:
# ChurchSpiderSpider.page_number += 1
# yield response.follow(next_page, callback=self.parse)
我找到了一点解决办法,但仍然不完全 现在的输出是
{'church_address': ['7763 Highway 21'],
'church_email': ['herbklaehn', ' [at] ', 'gmail.com'],
'church_name': ['Allenford United Church'],
'church_phone': ['519-35-6232']}
你能帮我用@替换[at]并把它组合成一个字符串吗?你知道吗
这是询问者的完整密码
你可以尝试使用Selenium进行webscraping,我尝试了这个代码,它给出了完美的结果。你知道吗
结果:
使用靓汤
获取电子邮件的一个简单方法是使用
class=field-name-field-mu-email'
查找div,然后将odd显示替换为适当的电子邮件格式。你知道吗例如:
相关问题 更多 >
编程相关推荐