美丽之光不会影响到一个孩子

import requests as req from bs4 import BeautifulSoup as soup url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections' session = req.Session() content = session.get(url) html2bs = soup(content.content, 'lxml') gs_cit = html2bs.select('#gs_cit') gs_citd = html2bs.find('div', {'id':"gs_citd"}) gs_cit1 = html2bs.find('div', {'id':"gs_cit1"})

1条回答

网友

1楼 · 发布于 2024-06-18 17:17:59

好吧，我想好了。我使用了用于python的selenium模块，它创建了一个虚拟浏览器，允许您执行诸如单击链接和获得结果HTML的输出之类的操作。在解决这个问题时，我遇到了另一个问题，那就是页面必须被加载，否则它只会在弹出的div中返回“Loading…”的内容，所以我使用python时间模块time.sleep(2)2秒钟，这样就可以加载内容了。然后，我使用beauthoulsoup解析得到的HTML输出，找到类为“gs_citi”的锚定标记。然后从锚中提取href并将其放入带有“requests”python模块的请求中。最后，我将解码后的响应写入本地文件-学者围兜. 在

我在Mac电脑上安装了chromedriver和selenium，使用如下说明： https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

然后通过python文件签名，允许使用以下说明停止防火墙问题： Add Python to OS X Firewall Options?

以下是我用来生成输出文件的代码“学者围兜“：

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")

# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()

print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds

# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source

# We are done with the driver so quit.
driver.quit()

# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')

# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})

# Get the href attribute of the first anchor found
href = gs_citt['href']

print("Fetching: ", href)

# Instantiate a new requests session
session = req.Session()

# Get the response object of href
content = session.get(href)

# Get the content and then decode() it.
bibtex_html = content.content.decode()

# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
    file.writelines(bibtex_html)

希望这能帮助任何人找到解决办法。在

在学者围兜文件：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章