如何在python中使多个页面中的刮取数据数组具有相同的长度?

2024-10-01 17:22:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将单个数据点应用于数据框中的多行(例如,下面的cst_n&;vv1),因此excel中的输出如下所示:

enter image description here

我的代码打算从公共政府数据库的多个页面中获取选举结果。它运行的每个页面都有不同数量的可用数据(例如,第1页有5个候选,第2页有9个)。我尝试将cst_n和vv1变量乘以for循环中的pty_n长度来刮取这些页面。不确定为什么在运行此代码时,我会不断收到“ValueError:数组的长度必须相同”:

import requests from requests import get from bs4 import BeautifulSoup import pandas as pd from time import sleep from random import randint constituencies = [] candidates = [] partynames = [] votes = [] partyvoteshare = [] totalvotes = [] for page in range(100,326): page = requests.get("https://results.aec.gov.au/24310/Website/HouseDivisionPage-24310-" + str(page) + ".htm", verify=False) page.encoding = page.apparent_encoding if not page: pass else: soup = BeautifulSoup(page.text, 'html.parser') aust_tbody = soup.find_all('tbody') sleep(randint(2,10)) for container in aust_tbody: #### CANDIDATES #### can = container.find_all('td', {'headers':'fpCan'}) for data in can: can2 = str(data.get_text()) candidates.append(can2) #### PARTY NAMES #### partyn = container.find_all('td', {'headers':'fpPty'}) for data in partyn: partyn2 = str(data.get_text()) partynames.append(partyn2) #### VOTES #### votec = container.find_all('td', {'headers':'fpVot'}, class_='row-right') for data in votec: votec2 = str(data.get_text()) votes.append(votec2) #### PARTY VOTE SHARE #### ptysh = container.find_all('td', {'headers':'fpPct'}, class_='row-right') for data in ptysh: ptysh2 = str(data.get_text()) partyvoteshare.append(ptysh2) #### TOTAL VOTES ####` for location in container.find_all('tr',class_='total'): finvotes = location.find('td', {'headers':'fpVot'}, class_='row-right') for data in finvotes: fvot = str(data.get_text()) fvot2 = [fvot] fvot3 = fvot2 * len(partyn) votes.append(fvot3) #### CONSTITUENCY NAMES #### constit = soup.find('h1',id_='StandardHeading') if constit is not None: constit = constit.get_text() else: constit = "N/A" constit_list = [constit] constit_list2 = constit_list * len(partyn) constituencies.append(constit_list2) aust19 = pd.DataFrame({ 'cst_n': constituencies, 'can': candidates, 'pty_n': partynames, 'pv1': votes, 'pvs1': partyvoteshare, 'vv1': totalvotes }) print(aust19) aust19.to_csv('aust19.csv')

有人能帮忙处理我代码中的#####################部分吗?非常感谢


Tags: textinimportfordatagetcontainerpage
1条回答
网友
1楼 · 发布于 2024-10-01 17:22:28

您获得的数据长度不同。一行一行地试试。 试试这个

from simplified_scrapy import SimplifiedDoc, utils, req

html = req.get('https://results.aec.gov.au/24310/Website/HouseDivisionPage-24310-101.htm')
doc = SimplifiedDoc(html)
table = doc.select('table#fp').getTable()
print (table)

utils.save2csv('data_2.csv', table)


# Or
table = doc.select('table#fp').trs.children.text
table = doc.select('table#fp').selects('tr').children.text

相关问题 更多 >

    热门问题