使用python scrapy刮取不在视图页源中的数据

2024-04-28 15:06:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我想抓取此链接的电子邮件:

https://threebestrated.ca/children-dentists-in-airdrie-ab

但是输出显示null,因为这些不在视图页源中

代码如下:

import scrapy
class BooksSpider(scrapy.Spider):
    name = "3bestrated"
    allowed_domains = ['threebestrated.ca']
    start_urls = ["https://threebestrated.ca/children-dentists-in-airdrie-ab"]

    def parse(self, response):
        emails = response.xpath("//a[contains(@href, 'mailto:')]/text()").getall()
        yield {
        "a": emails,
        }

1条回答
网友
1楼 · 发布于 2024-04-28 15:06:35

电子邮件地址是以某种方式编码的,以防止原始的刮取。下面是一个这样的编码电子邮件地址:

<p>
    <a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
        <i class="fa fa-envelope-o"></i>
        <span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email&#160;protected]</span> 
   </a>
</p>

然后使用this JavaScript脚本对其进行解码

因此,您的选择是:

  • 对解码脚本进行反向工程
  • 使用某种JavaScript运行时来执行解码脚本
  • 如果要使用JavaScript运行时,不妨使用 首先是Selenium(似乎存在一个scrapy Selenium中间件,如果您想坚持使用scrapy,可以使用它)

编辑-为了好玩,我对其进行了反向工程:

def deobfuscate(string, start_index):

    def extract_hex(string, index):
        substring = string[index: index+2]
        return int(substring, 16)

    key = extract_hex(string, start_index)
    for index in range(start_index+2, len(string), 2):
        yield chr(extract_hex(string, index) ^ key)


def process_tag(tag):
    url_fragment = "/cdn-cgi/l/email-protection#"
    href = tag["href"]
    start_index = href.find(url_fragment)
    if start_index > -1:
        return "".join(deobfuscate(href, start_index + len(url_fragment)))
    return None

def main():

    import requests
    from bs4 import BeautifulSoup as Soup
    from urllib.parse import unquote

    url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"

    response = requests.get(url)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    print("E-Mail Addresses from <a> tags:")
    for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
        print(email)

    cf_elem_attr = "data-cfemail"

    print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
    for tag in soup.find_all(attrs={cf_elem_attr:True}):
        email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
        print(email)
        

if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

E-Mail Addresses from <a> tags:
info@sierradental.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info@mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends@toothpals.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support@threebestrated.ca

E-Mail Addresses from tags where "data-cfemail" attribute is present:
info@sierradental.ca
friends@toothpals.ca
support@threebestrated.ca
>>> 

相关问题 更多 >