Python以类似的格式抓取数据

2024-05-19 10:09:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我第一次在这里写作,虽然我花了足够的时间寻找答案。 这是一个很简单的问题,我希望有人能帮忙

我想从下面的url中获取一些数据: https://tracker.icon.foundation/addresses/1?count=100https://tracker.icon.foundation/block/1

该网站是动态更新的,因此我不得不使用美丽的硒汤

我试图使用相同的代码来获取信息(两种情况下都是地址),因为在我看来,这两个url的结构完全相同。我可以从第一个列表中删除,但第二个返回一个空列表。。我有两个问题: a) 知道为什么吗?:) b) 性能相当慢-有更好的方法吗

谢谢

我为第一个url使用的代码如下所示

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

driver = webdriver.Chrome(executable_path="MYPATHFORWEBDRIVER")
driver.get('https://tracker.icon.foundation/addresses/1?count=100')
res = driver.execute_script('return document.documentElement.outerHTML')
driver.quit()

soup = BeautifulSoup(res, 'lxml')

box = soup.find('div', {'class': 'table-box'})

all_addresses = box.find_all('span', {'class': 'ellipsis'})
AddressList = []

for address in all_addresses:
    a_type = address.find('a', {'class': 'on'}).text
    AddressList.append(a_type)

print(AddressList)

Tags: 代码httpsimportboxurladdressesdrivercount
2条回答

数据是动态加载的,因此requests不支持它。但是,我们可以通过向发送GET请求来获取数据

https://tracker.icon.foundation/v3/address/list?page=<PAGE NUM>&count=100

要从第1-100页获取所有数据,我们可以使用^{}函数

import requests

url = "https://tracker.icon.foundation/v3/address/list?page={}&count=100"

for page in range(1, 101):
    print("URL: ", url.format(page))
    response = requests.get(url.format(page)).json()
    print([data["address"] for data in response["data"]], "\n")

输出(部分):

URL:  https://tracker.icon.foundation/v3/address/list?page=1&count=100
['hxcd6f04b2a5184715ca89e523b6c823ceef2f9c3d', .... 'hx8d45deb8de633ca9d5de5fc5a64c51bccd8e9960'] 
...
...

正如您提到的,页面是动态加载的。这里的问题是,您在寻找作为DOM一部分的html树之前获得了html

driver = webdriver.Chrome()
driver.get('https://tracker.icon.foundation/addresses/1?count=100')

# Wait until the element is loaded (found)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.table-box')))

#Parse the page_source
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
box = soup.find('div', {'class': 'table-box'})


all_addresses = box.find_all('span', {'class': 'ellipsis'})
AddressList = []

for address in all_addresses:
    a_type = address.find('a', {'class': 'on'}).text
    AddressList.append(a_type)
print(AddressList)

输出:

['hxcd6f04b2a5184715ca89e523b6c823ceef2f9c3d', 'hx9f0c84a113881f0617172df6fc61a8278eb540f5', 'hx1729b35b690d51e9944b2e94075acff986ea0675', 'hx562dc1e2c7897432c298115bc7fbcc3b9d5df294', 'hx68646780e14ee9097085f7280ab137c3633b4b5f', 'hx0cc3a3d55ed55df7c8eee926a4fafb5412d0cca4', 'hx9d9ad1bc19319bd5cdb5516773c0e376db83b644', 'hxa9c54005bfa47bb8c3ff0d8adb5ddaac141556a3', 'hxc1481b2459afdbbde302ab528665b8603f7014dc', 'hx3f945d146a87552487ad70a050eebfa2564e8e5c', 'hx6b38701ddc411e6f4e84a04f6abade7661a207e2', 'hx7062a97bed64624846f3134fdab3fb856dce7075', 'hx8913f49afe7f01ff0d7318b98f7b4ae9d3cd0d61', 'hx980ab0c7473013f656339795a1c63bf44898ce95', 'hxbc2f530a7cb6170daae5876fd24d5d81170b93fe', 'hxc17ff524858dd51722367c5b04770936a78818de', 'hxfc7888bf63d45df125cf567fd8753c05facb3d12', 'hxd3b53e10d8c4c755879be09ff9ba975069664b7a', 'hxdf6bd350edae21f84e0a12392c17eac7e04817e7', 'hx9e19d60c9d6a0ecc2bcace688eff9053622c0c4c', 'hxa527f96d8b988a31083167f368105fc0f2ab1143', 'hx1000000000000000000000000000000000000000', 'hx294c5d0699615fc8d92abfe464a2601612d11bf7', 'hxd42f6e3abfb7f5b14dbdafa34f03ffecf2a53a92', 'hxe322ab9b11b63c89b85b9bc7b23350b1d6604595', 'hx87b6da94535754c2baee9d69010eb1b91eaa4c37', 'hx58b2592941f61f97c7a8bed9f84c543f12099239', 'hx8d6aa6dce658688c76341b7f70a56dce5361e7ef', 'hx930bb66751f476babc2d49901cf77429c5cf05c1', 'hx39f2636582cee00b72586a2f74dc6028c0f0213f', 'hxd9fb974459fe46eb9d5a7c438f17ae6e75c0f2d1', 'hx49c5c7eead084999342dd6b0656bc98fa103b185', 'hx76dcc464a27d74ca7798dd789d2e1da8193219b4', 'hxc05ec08b6446a2a16b64eb19b96ea02225b840ab', 'hxe295a8dc5d6c29109bc402e59394d94cf600562e', 'hxf6f5f2583a0821f281fe7d35b013b9389daf2aaa', 'hxb6a65d0e7d5c1c0150310287e97c612a8ac825eb', 'hx6d14b2b77a9e73c5d5804d43c7e3c3416648ae3d', 'hxeb00139ddd1fa4507d3158e46e22c4ab7e8b9202', 'hx538de7e0fc0d312aa82549aa9e4daecc7fabcce9', 'hx6eb81220f547087b82e5a3de175a5dc0d854a3cd', 'hxc4193cda4a75526bf50896ec242d6713bb6b02a3', 'hx266c053380ad84224ea64ab4fa05541dccc56f5f', 'hx476455c56c64fea4f425bd62f0d2d3ab8cdcace0', 'hx85532472e789802a943bd34a8aeb86668bc23265', 'hx56ef2fa4ebd736c5565967197194da14d3af88ca', 'hx5d0409cabaacd0f1ef22d32f41a30649ee990103', 'hxbf90314546bbc3ed980454c9e2a9766160389302', 'hxe07878b53679ba1278d0aab1dac7646d8898d344', 'hx96505aac67c4f9033e4bac47397d760f121bcc44', 'hx18d14ad97ab2903dfc246ccc4e5631a3a1e13141', 'hx39f24d1d23b710b9d6ef6f56acfb3022deed8f4d', 'hxc574629fa3d1cc846611f1ab91d504ad7fc35413', 'hxd7a34c15c2345d9f0891545350181c7b162d9e08', 'hx314348ecbaf01ff6c65c2877a6c593a5facecb35', 'hx23cb1d823ef96ac22ae30c986a78bdbf3da976df', 'hx307c01535bfd1fb86b3b309925ae6970680eb30d', 'hx206846faa5ba46a3717e7349bff9c20d6fe1bba3', 'hx161665fb51075d37cee2c98c2eacbc94d207a58b', 'hxaf3a561e3888a2b497941e464f82fd4456db3ebf', 'hxa55446e81997c03ee856a58ee18432325a4ef924', 'hx6607ef84572ac9bdb1ebbcf49bd0cddaf8903b8e', 'hx37f86d0a4e1e2fada3ca724a401037c83e0a670e', 'hxc4bb0a9c4b3e9e5953262aa7cb940dbecb568ff6', 'hxc39a4c8438abbcb6b49de4691f07ee9b24968a1b', 'hxc35567f68c43bd98b020e5f0fab69ca21edb2726', 'hxc0fc3fca32bddba77f372c69c5998c1da81d531d', 'hx25c5dace83bceae42c11360a07c9e42a3b5c6122', 'hx387f3016ee2e5fb95f2feb5ba36b0578d5a4b8cf', 'hx94a7cd360a40cbf39e92ac91195c2ee3c81940a6', 'hx748717b9f846120033ba44e986722d82ef710afb', 'hxebfc6198ce53846b2e5d4ef31d6d13d0fa951c01', 'hx4602589eb91cf99b27296e5bd712387a23dd8ce5', 'hxa67e30ec59e73b9e15c7f2c4ddc42a13b44b2097', 'hx558b4cd8cd7c25fa25e3109414bb385e3c369660', 'hxad2bc6446ee3ae23228889d21f1871ed182ca2ca', 'hx777ee46b7b9d90a26388fbcdeafc742f5f217af7', 'hx10ed7a7065d920e146c86d3915491f5a67248647', 'hx33fc29d457e7815fe6c7cec4304500d5214fdac1', 'hx1494e29d38aea1a8e39fb40663c37f714bffb9df', 'hx9db3998119addefc2b34eaf408f27ab8103edaef', 'hxa91a8cd8141d192f78540d092d912456fb81d281', 'hx0b047c751658f7ce1b2595da34d57a0e7dad357d', 'hxc0b37fc42b52bf467720cf362c87c650ae3a7915', 'hx1216aa95cf2aea0a387b7c243412022f3d7cf29f', 'hx4c7c152410d4defd66da1d6c399c01de0bc295a5', 'hx4d83813703f81cdb85f952a1d1ee736faf732655', 'hx6d2240f9c5fd0db5df6977ee586c3d74e1b1e4aa', 'hxaafc8af9559d5d320745345ec006b0b2170194aa', 'hxabdde23cda5b425e71907515940a8f23e29a3134', 'hx6c22cdba886614d3173e3d2499dc1597bdb57f2c', 'hxd8ba6317da2eec0d9d7d1feed4c9c1f3cf358ae1', 'hxdd4bc4937923dc140adba57916e3559d039f4203', 'hx27664cffd284b8cf488eefb7880f55ce82f42297', 'hx8243caf740598c2b8d57c177adc999da68ea46bf', 'hx230fdd2eb239d5f590c7ff2d94a32b8899fc0628', 'hxd3f062437b70ab6d6a5f21b208ede64973f70567', 'hxc41d0b098fe9b86b241dfbfa9741f00975db2847', 'hxed97cf0deba62ffe843262f7f4596e155e8ea0b9', 'hx8d45deb8de633ca9d5de5fc5a64c51bccd8e9960']

注意:您将需要以下附加导入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

对于关于性能的另一个问题,您可能会使用无头浏览器,这将使速度加快一点

options = webdriver.ChromeOptions()
options.add_argument(' headless')
driver = webdriver.Chrome(options=options)

相关问题 更多 >

    热门问题