我如何从谷歌上抓取这些酒吧的名字?

2024-10-01 17:22:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直试图从这个链接中擦掉香港市中心的酒吧的名字: link

但是,我无法使用class='dbg0pd'属性刮取数据

代码:

import requests
from bs4 import BeautifulSoup
bars = 'https://www.google.com/search?q=bar%20hong%20kong%20central&biw=1246&bih=714&sz=16&tbm=lcl&sxsrf=ALeKk02B3dHjl422M1wOkUldNgdUeC6RVA%3A1621869556252&ei=9MOrYMzsDobZ-QbhyK6YDA&oq=bar+hong+kong+central&gs_l=psy-ab.12...0.0.0.2313.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.vxIZeVhM24g&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!1m4!1u2!2m2!2m1!1e1!1m4!1u16!2m2!16m1!1e1!1m4!1u16!2m2!16m1!1e2!2m1!1e2!2m1!1e16!2m1!1e3!3sIAE,lf:1,lf_ui:9&rlst=f#rlfi=hd:;si:;mv:[[22.287261599999997,114.1668826],[22.2769662,114.1507584]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!1m4!1u2!2m2!2m1!1e1!1m4!1u16!2m2!16m1!1e1!1m4!1u16!2m2!16m1!1e2!2m1!1e2!2m1!1e16!2m1!1e3!3sIAE,lf:1,lf_ui:9'
info = requests.get(bars)
soup = BeautifulSoup(info.text, "lxml")
soup.select('.dbg0pd')

代码返回一个空列表[],我也尝试了一些其他类


Tags: 代码importinfoui链接requestssouplf
3条回答

谷歌是一个非常复杂的搜索引擎,它不能简单地用一个get请求来抓取,它还具有防机器人篡改功能,以防止人们抓取该网站(因为谷歌希望开发者为该API付费并使用该API)。这是我用python编写的一个google搜索模块,它是我正在从事的一个项目的一部分

通过此代码发送的请求将被Google服务器接受,因为它模拟了真实web浏览器的行为。通过发送带有用户代理标头的GET请求,并生成必要的cookie

from bs4 import BeautifulSoup
import requests, json, os
import datetime

class google_search():
    def __init__(self):
        self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
        self.url = "https://www.google.co.uk"
        self.domain = ".google.co.uk"
        self.output_filename = "output.html"
        self.write_2_file = False
        self.return_content = True

    def search(self, query):
        # generate header
        self.head =  {
            "User-Agent": self.user_agent,
        }

        # generate cookies
        self.current_date = datetime.datetime.now()
        self.todays_date = self.current_date.strftime("%Y-%m-%d-%S")
        self.date_in_month = datetime.datetime(
            self.current_date.year,
            self.current_date.month+1,
            self.current_date.day-1,
            self.current_date.hour,
            self.current_date.minute,
            self.current_date.second
        ).strftime("%a, %d-%b-%Y %H:%M:%S")
        
        self.expiry_date = f"expires={self.date_in_month} GMT"
        self.consent_cookie_fname = "YES+cb.{self.current_date.strftime('%Y%m%d-%m-p0')}.en+FX+949"
        self.cookie = {
            "1P_JAR" : f"={self.todays_date}; {self.expiry_date}; path=/; domain={self.domain}; Secure; SameSite=none",
            "CONSENT" : f"{self.consent_cookie_fname}; Domain={self.domain}; {self.expiry_date}; Path=/; Secure; SameSite=none"
        }

        # send request
        self.s = requests.Session()
        self.res = requests.get(f"{self.url}{query}", headers=self.head, cookies=self.cookie)
        html = self.res.content

        # write to file
        if self.write_2_file == True:
            with open(self.output_filename, "wb") as file:
                file.write(html)
                
        elif self.return_content == True:
            return html

url = "https://www.google.com/search?q=bar%20hong%20kong%20central&biw=1246&bih=714&sz=16&tbm=lcl&sxsrf=ALeKk02B3dHjl422M1wOkUldNgdUeC6RVA%3A1621869556252&ei=9MOrYMzsDobZ-QbhyK6YDA&oq=bar+hong+kong+central&gs_l=psy-ab.12...0.0.0.2313.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.vxIZeVhM24g&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!1m4!1u2!2m2!2m1!1e1!1m4!1u16!2m2!16m1!1e1!1m4!1u16!2m2!16m1!1e2!2m1!1e2!2m1!1e16!2m1!1e3!3sIAE,lf:1,lf_ui:9&rlst=f#rlfi=hd:;si:;mv:[[22.287261599999997,114.1668826],[22.2769662,114.1507584]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!1m4!1u2!2m2!2m1!1e1!1m4!1u16!2m2!16m1!1e1!1m4!1u16!2m2!16m1!1e2!2m1!1e2!2m1!1e16!2m1!1e3!3sIAE,lf:1,lf_ui:9"

req = google_search()
req.url = url
html = req.search("")

您可以在我的GitHub存储库here上签出完整代码

你需要考虑的是,当谷歌通过提供的API等自动请求其服务时,它是非常有限的。试着运行这个例子,然后打印你得到的html的title,它可能是

Before proceeding to Google Search

所以,这就是为什么会出现空列表,因为脚本中的页面与浏览器中的页面不同(可能充满了谷歌的cookies,系统也很熟悉)

当你用这种方式与谷歌合作时,你需要考虑一些真实的人行为和配置欺骗。p>

Google Maps是一个Javascript驱动的网站,为了使它能够与BS4一起工作,您需要使用正则表达式解析window.APP_INITIALIZATION_STATE(查看页面的源代码)变量块,以找到您要查找的内容

BeautifulSoup无法刮取动态网站。这就是为什么你会得到一个空的list的原因,因为作为回应,你没有寻找这样的类

要使其正常工作,可以使用selenium库,这是浏览器自动化:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options = options)

# Opens URL
driver.get('https://www.google.com/search?q=central+hong+kong+bar')
# Clicks on the "Maps" view in Google Search, clicks on it and turns up on google maps
driver.find_element_by_xpath('//*[@id="hdtb-msb"]/div[1]/div/div[2]/a').click()

# Now, this part is akward but very simple. There's a better solution using a while loop.
# Locates first bar
element_container = driver.find_element_by_xpath('//*[@id="pane"]/div/div[1]/div/div/div[4]/div[1]/div[1]/div/a')
# Scrolls down to the "end" from the first bar
element_container.send_keys(Keys.END)
# Sleep for 3 sec until other bars are loaded
time.sleep(3)
# Scrolls down to the "end" again
element_container.send_keys(Keys.END)
time.sleep(3)
# Scrolls down to the "end" again
element_container.send_keys(Keys.END)

# Locates CSS selector for name and prints it
for names in driver.find_elements_by_css_selector('.qBF1Pd-haAclf'):
    print(names.text)
driver.quit()

输出:

Quinary
The Old Man
The Envoy
COA
001
ROOM 309
HONI HONI Tiki Cocktail Lounge
ORIGIN gin bar
The Iron Fairies Hong Kong
Stockton
Tell Camellia Cocktail Bar
The Pontiac
Frank's Library
Dr. Fern's Gin Parlour
Wahtiki Island Lounge
Draft Land HK
The Wise King
The Diplomat Hong Kong
Karma Lounge
Geronimo Shot Bar HK

或者,您可以使用SerpApi中的Google Maps API。这是一个付费API,免费试用5000次搜索

主要区别在于,您不必弄清楚如何抓取复杂的Javascript驱动的网站,也不必考虑如何解决CAPTCHA(如果出现)或查找代理(如果需要)。查看Playground

要集成的代码:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_maps",
  "type": "search",
  "google_domain": "google.com",
  "q": "central hong kong bar",
  "hl": "en",
  "ll": "@22.2822068,114.1511132,16z"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['local_results']:
    bar_name = result['title']
    print(bar_name)

输出:

Quinary
COA
001
The Old Man
ROOM 309
ORIGIN gin bar
The Envoy
Wahtiki Island Lounge
The China Bar, Lan Kwai Fong
The Iron Fairies Hong Kong
Captain's Bar
Le Boudoir
Bar De Luxe
Please Don't Tell
Owl Lounge HK
Tell Camellia Cocktail Bar
HONI HONI Tiki Cocktail Lounge
J.Boroski
Frank's Library
Geronimo Shot Bar HK

Disclaimer, I work for SerpApi.

相关问题 更多 >

    热门问题