我正试图从一个站点中获取一小块信息:获取数据并将其存储在CSV数据集中。项目:有关城市和村庄社区服务和官方帮助台的联系数据列表:-1600条记录
基本站点:https://www.service-bw.de/web/guest/trefferliste/-/trefferliste/q-rathaus
详情页面:Rathaus[Gemeinde Grünkraut] https://www.service-bw.de/web/guest/organisationseinheit/-/sbw-oe/Rathaus-6000566-organisationseinheit-0
注:我们大约有1600页。。所以其中一个主要问题是-如何将他们聚集到节目中。。。如何在包含数据的所有页面上循环
<div class="sp-m-organisationseinheitDetails-basisInfos-content sp-l-grid-container">
<div class="sp-l-grid-row">
<div class="sp-l-grid-col-md-6 sp-l-grid-col-sm-6 sp-l-grid-xs-col-12">
<div> <div itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress" class="sp-m-organisationseinheitDetails-basisInfos-addressBlock">
<h4 class="sp-m-organisationseinheitDetails-basisInfos-detailsTitle mdash">Hausanschrift</h4>
<div itemprop="streetAddress"> <span>Scherzachstr.</span> <span>2</span><br>
所需输出:
Hausanschrift:
- name
- street & housenumber
- postal code & town
Kontaktmöglichkeiten:
- telehon
- fax
- e-mail
- internet
在图像中查看每个记录中的信息块-在1600多个记录中的每个记录中…:
我的方法:
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://www.service-bw.de/web/guest/trefferliste/-/trefferliste/q-rathaus{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h4", class_="sp-m-organisationseinheitDetails-basisInfos-content sp-l-grid-container").find_next("ul").findAll("dd itemprop")[:8]]
head = [soup.find("h4", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("Telefon", "Fax", "E-Mail", "Internet"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
顺便说一句,也许我们能够获得更进一步的(额外的内容)-但目前我试图弄清楚如何获得页面并对其进行一般解析的基本掌握
我陷入困境的地方:我得到错误:File "C:\Users\Kasper\Documents_f_s_j_mk__dev_\bs\bw.py", line 28 target = [item.get_text(strip=True, separator=" ") for item in soup.find( ^ IndentationError: unexpected indent [Finished in 0.32s]
但除此之外,我怀疑整个代码运行良好,并获取了所有需要的项目
也许你能给我一个提示和一些指导。。thx提前
相关问题 更多 >
编程相关推荐