使用python scrapy刮取不在视图页源中的数据

1条回答

网友

1楼 · 发布于 2024-04-28 15:06:35

电子邮件地址是以某种方式编码的，以防止原始的刮取。下面是一个这样的编码电子邮件地址：

<p>
    <a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
        <i class="fa fa-envelope-o"></i>
        <span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email&#160;protected]</span> 
   </a>
</p>

然后使用this JavaScript脚本对其进行解码

因此，您的选择是：

对解码脚本进行反向工程
使用某种JavaScript运行时来执行解码脚本
如果要使用JavaScript运行时，不妨使用首先是Selenium（似乎存在一个scrapy Selenium中间件，如果您想坚持使用scrapy，可以使用它）

编辑-为了好玩，我对其进行了反向工程：

def deobfuscate(string, start_index):

    def extract_hex(string, index):
        substring = string[index: index+2]
        return int(substring, 16)

    key = extract_hex(string, start_index)
    for index in range(start_index+2, len(string), 2):
        yield chr(extract_hex(string, index) ^ key)


def process_tag(tag):
    url_fragment = "/cdn-cgi/l/email-protection#"
    href = tag["href"]
    start_index = href.find(url_fragment)
    if start_index > -1:
        return "".join(deobfuscate(href, start_index + len(url_fragment)))
    return None

def main():

    import requests
    from bs4 import BeautifulSoup as Soup
    from urllib.parse import unquote

    url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"

    response = requests.get(url)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    print("E-Mail Addresses from <a> tags:")
    for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
        print(email)

    cf_elem_attr = "data-cfemail"

    print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
    for tag in soup.find_all(attrs={cf_elem_attr:True}):
        email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
        print(email)
        

if __name__ == "__main__":
    import sys
    sys.exit(main())

输出：

E-Mail Addresses from <a> tags:
info@sierradental.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info@mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends@toothpals.ca?subject=Request through ThreeBestRated.ca
reviews@threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support@threebestrated.ca

E-Mail Addresses from tags where "data-cfemail" attribute is present:
info@sierradental.ca
friends@toothpals.ca
support@threebestrated.ca
>>>

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用python scrapy刮取不在视图页源中的数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >