<p>你忘了提到你实际上想搜集什么信息,所以我建议的以下替代解决方案只能帮你这么多。如果您能详细说明,并让我知道您试图获取的信息,我可以定制我的解决方案</p>
<p>记录ones的网络流量(在浏览器中查看页面时)会发现向各种REST API端点发出了多个XHR(XmlHttpRequest)HTTP GET请求,其响应是JSON,并且包含您可能想要获取的所有信息</p>
<p>我的建议是简单地模拟对必要的RESTAPI端点的HTTP GET请求。无需硒:</p>
<pre><code>def get_country_id(country_name):
import requests
url = "https://www.transfermarkt.com/quickselect/countries"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return next((country["id"] for country in response.json() if country["name"] == country_name), None)
def get_competitions(country_id):
import requests
url = "https://www.transfermarkt.com/quickselect/competitions/{}".format(country_id)
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.json()
def main():
country_name = "Iceland"
country_id = get_country_id(country_name)
assert country_id is not None
print("Competitions in {}:".format(country_name))
for competition in get_competitions(country_id):
print(competition["name"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
</code></pre>
<p>输出:</p>
<pre><code>Competitions in Iceland:
Pepsi Max deild
Lengjudeild
Mjólkurbikarinn
Lengjubikarinn
>>>
</code></pre>
<hr/>
<p>编辑-不幸的是,您试图获取的表数据并非来自API。它直接烘焙到页面的HTML中。不过,您不需要为此使用硒-BeautifulSoup已经足够好了:</p>
<pre><code>def get_entries():
import requests
from bs4 import BeautifulSoup as Soup
from operator import attrgetter
url = "https://www.transfermarkt.com/premierleague/startseite/wettbewerb/GB1/plus/"
params = {
"saison_id": "2019"
}
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
table = soup.find("table", {"class": "items"})
assert table is not None
# Get text from header cells whose class does not contain the substring "hide"
fieldnames = list(map(attrgetter("text"), table.select("thead > tr > th:not([class*=\"hide\"])")))
yield fieldnames
for row in table.select("tbody > tr"):
# Assuming the first column will always be an img
columns = list(map(attrgetter("text"), row.select("td:not([class*=\"hide\"])")[1:]))
yield dict(zip(fieldnames, columns))
def main():
from csv import DictWriter
entries = get_entries()
fieldnames = next(entries)
with open("output.csv", "w", newline="") as file:
writer = DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for entry in entries:
writer.writerow(entry)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
</code></pre>
<p>CSV输出:</p>
<pre><code>club,Squad,Total MV,ø MV
Man City,34,€1.27bn,€37.46m
Liverpool,56,€1.09bn,€19.53m
Spurs,36,€1.04bn,€28.94m
Chelsea,36,€797.00m,€22.14m
Man Utd,43,€775.20m,€18.03m
Arsenal,38,€680.55m,€17.91m
Everton,35,€525.50m,€15.01m
Leicester,32,€384.75m,€12.02m
West Ham,38,€371.75m,€9.78m
Wolves,44,€315.40m,€7.17m
Newcastle,41,€312.58m,€7.62m
Bournemouth,39,€311.20m,€7.98m
Watford,43,€270.65m,€6.29m
Southampton,36,€259.80m,€7.22m
Crystal Palace,33,€248.65m,€7.53m
Brighton,45,€225.83m,€5.02m
Burnley,35,€205.58m,€5.87m
Aston Villa,38,€184.60m,€4.86m
Norwich,38,€110.85m,€2.92m
Sheff Utd,34,€110.80m,€3.26m
</code></pre>
<p>真正的解决方案可能包括通过BeautifulSoup将对REST API的请求和对表数据的抓取结合起来——您将遍历每个国家、每个国家的竞争对手以及每年的竞争对手。我发布的更新代码假设我们只对ID<code>GB1</code>(在英国)的竞争感兴趣,并且只对2019年感兴趣</p>
<p>编辑-您必须稍微调整我的解决方案。我只过滤并保留那些其类不包含子字符串“hide”的列,但事实证明其中一些列很重要(例如<code>age</code>列)</p>