如何使用异常表格从选举网站中获取数据

2024-09-27 00:21:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从一个选举网站上搜集一些数据,但不知道如何使用BeautifulSoup提取这些数据

德克萨斯州选举结果 https://results.texas-election.com/contestdetails?officeID=1001&officeName=PRESIDENT%2FVICE-PRESIDENT&officeType=FEDERAL%20OFFICES&from=race

我尝试过的代码

import pandas as pd
from bs4 import BeautifulSoup

tx_url = 'https://results.texas-election.com/contestdetails?officeID=1001&officeName=PRESIDENT%2FVICE-PRESIDENT&officeType=FEDERAL%20OFFICES&from=race'


import urllib.request
local_filename, headers = urllib.request.urlretrieve(tx_url)

urllib.error.HTTPError: HTTP Error 403: Forbidden

soup = BeautifulSoup(tx_url)

/home/server/pi/homes/woodilla/.conda/envs/baseDS_env/lib/python3.7/site-packages/bs4/init.py:357: UserWarning: "https://results.texas-election.com/contestdetails?officeID=1001&officeName=PRESIDENT%2FVICE-PRESIDENT&officeType=FEDERAL%20OFFICES&from=race" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup. that document to Beautiful Soup.' % decoded_markup

下面是该表的外观:

enter image description here


Tags: fromhttpsimportcomresultstxracebeautifulsoup
1条回答
网友
1楼 · 发布于 2024-09-27 00:21:11

首先,您得到的错误意味着您错误地使用了BeautifulSoup

您需要将HTTP客户端的响应传递给BeautifulSoup,如下所示:

import requests
from bs4 import BeautifulSoup

url = "https://results.texas-election.com/races"

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

其次,你不需要BeautifulSoup去刮那一页。一切都会回来的。例如:

import requests

url = "https://results.texas-election.com/static/data/election/44146/246/Federal.json"

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
response = requests.get(url, headers=headers).json()

for race in response["Races"]:
    print(f"Results for {race['N']}")
    for candidate in race["Candidates"]:
        print(f"{candidate['N']} - {candidate['P']}: Votes {candidate['V']} - {candidate['PE']}%")
    print(f"Total votes: {race['T']}")
    print("-" * 80)

输出:

RESIDENT/VICE-PRESIDENT
ROQUE "ROCKY" DE LA FUENTE GUERRA - REP: Votes 7563 - 0.37%
BOB ELY - REP: Votes 3582 - 0.18%
ZOLTAN G. ISTVAN - REP: Votes 1447 - 0.07%
MATTHEW JOHN MATERN - REP: Votes 3512 - 0.17%
DONALD J. TRUMP (I) - REP: Votes 1898664 - 94.13%
JOE WALSH - REP: Votes 14772 - 0.73%
BILL WELD - REP: Votes 15824 - 0.78%
UNCOMMITTED - REP: Votes 71803 - 3.56%
Total votes: 2017167
                                        
U. S.  SENATOR
VIRGIL BIERSCHWALE - REP: Votes 20494 - 1.06%
JOHN ANTHONY CASTRO - REP: Votes 86916 - 4.49%
JOHN CORNYN (I) - REP: Votes 1470669 - 76.04%
DWAYNE STOVALL - REP: Votes 231104 - 11.95%
MARK YANCEY - REP: Votes 124864 - 6.46%
Total votes: 1934047
                                        
U. S. REPRESENTATIVE DISTRICT 1
JOHNATHAN KYLE DAVIDSON - REP: Votes 9659 - 10.33%
LOUIE GOHMERT (I) - REP: Votes 83887 - 89.67%
Total votes: 93546
                                        
and so on ...

编辑:

要获取您提到的特定URL的数据,只需使用以下命令:

注意:这只是数据的一小部分,因为JSON是巨大的。我添加了用于转储整个JSON的代码,这样您就可以按照您想要的方式解析它

import json

import requests

url = "https://results.texas-election.com/static/data/election/44144/108/County.json"

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
response = requests.get(url, headers=headers).json()


with open("county_results.json", "w") as output:
    json.dump(response, output, indent=4, sort_keys=True)

for v in response.values():
    for id_, race_data in v["Races"].items():
        print(race_data["C"])

样本输出:

{'4250': {'id': 4250, 'N': 'KEN WISE (I)', 'P': 'REP', 'V': 0, 'PE': 0.0, 'C': '#E30202', 'O': 1, 'EV': 0}, '6015': {'id': 6015, 'N': 'TAMIKA "TAMI" CRAFT', 'P': 'DEM', 'V': 0, 'PE': 0.0, 'C': '#007BBD', 'O': 2, 'EV': 0}}
{'2966': {'id': 2966, 'N': 'BRENDA MULLINIX (I)', 'P': 'REP', 'V': 0, 'PE': 0.0, 'C': '#E30202', 'O': 1, 'EV': 0}, '6224': {'id': 6224, 'N': 'JANET BUENING HEPPARD', 'P': 'DEM', 'V': 0, 'PE': 0.0, 'C': '#007BBD', 'O': 2, 'EV': 0}}
{'2967': {'id': 2967, 'N': 'MAGGIE JARAMILLO (I)', 'P': 'REP', 'V': 0, 'PE': 0.0, 'C': '#E30202', 'O': 1, 'EV': 0}, '3708': {'id': 3708, 'N': 'TAMEIKA CARTER', 'P': 'DEM', 'V': 0, 'PE': 0.0, 'C': '#007BBD', 'O': 2, 'EV': 0}}
and much, much more...

我是如何找到JSON

I've inspected the network tab of the Developer Tool of my browser. :)

相关问题 更多 >

    热门问题