尝试使用selenium来webscrape ncbi时,数据不会加载,也不会包含在具有我可以等待的ID的元素中

2024-09-30 22:26:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从像https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta这样的网页上搜集基因数据

我用的是漂亮的汤和硒

数据位于id为viewercontent1的元素内。 当我用这个代码打印出来时:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re

secondDriver = webdriver.Chrome(executable_path='/Users/me/Documents/chloroPlastGenScrape/chromedriver')

newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
secondDriver.implicitly_wait(10)
WebDriverWait(secondDriver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
secondDriver.get(newLink)
html2 = secondDriver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)

我打印:

<div class="seq gbff" id="viewercontent1" sequencesize="450826" style="display: block;" val="426261815" virtualsequence=""><div class="loading">Loading ... <img alt="record loading animation" src="/core/extjs/ext-2.1/resources/images/default/grid/loading.gif"/></div></div>

内容似乎还没有加载完毕。 我尝试隐式地等待和检查内容是否已加载(在调用.get()函数之前和之后),但这似乎没有起到任何作用。 我不能等待内容通过ID(元素的位置)加载,因为数据直接包含在ID为on的<pre></pre>元素中

任何帮助都将不胜感激


Tags: 数据fromhttpsimportdivid元素内容
2条回答

要获取<div>的内容,可以使用以下脚本:

import requests
from bs4 import BeautifulSoup


url = 'https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta'
fasta_url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={id}&report=fasta'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
id_ = soup.select_one('meta[name="ncbi_uidlist"]')['content']
fasta_txt = requests.get(fasta_url.format(id=id_)).text

print(fasta_txt)

印刷品:

>KC208619.1 Butomus umbellatus mitochondrion, complete genome
CCGCCTCTCCCCCCCCCCCCCCGCTCCGTTGTTGAAGCGGGCCCCCCCCATACTCATGAATCTGCATTCC
CAACCAAGGAGTTGTCTCATATAGACAGAGTTGGGCCCCCGTGTTCTGAGATCTTTTTCAACTTGATTAA
TAAAGAGGATTTCTCGGCCGTCTTTTTCGGCTAGGCTCCATTCGGGGTGGGTGTCCAGCTCGTCCCGCTT
CTCGTTAAAGAAATCGATAAAGGCTTCTTCGGGGGTGTAGGCGGCATTTTCCCCCAAGTGGGGATGTCGA
GAAAGCACTTCTTGAAAACGAGAATAAGCTGCGTGCTTACGTTCCCGGATTTGGAGATCCCGGTTTTCGA

...and so on.

@Andrej的解决方案似乎简单得多,但如果你仍然想走等待路线

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re

driver = webdriver.Chrome()

newLink = "https://www.ncbi.nlm.nih.gov//nuccore/KC208619.1?report=fasta"
driver.get(newLink)
WebDriverWait(driver, 10).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')

element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "#viewercontent1 pre"))
    )

html2 = driver.page_source
subSoup = BeautifulSoup(html2, 'html.parser')
viewercontent1 = subSoup.findAll("div", {"id" : "viewercontent1"})[0]
print(viewercontent1)

相关问题 更多 >