下面是一个web scraper,它从这个website循环遍历每个医生的个人资料,并刮取他们的信息。代码运行时没有任何错误,但我正在尝试编写一个for循环,它将允许我抓取医生配置文件的前5页。在我下面的当前代码中,输出打印了显示在网站第5页上的信息,但是我很难弄清楚为什么它没有刮去前4页。这是我第一次循环一个过程,所以我认为一旦代码调用网页,然后必须运行该过程,就会出现问题。有人知道如何解决这个问题吗?提前谢谢
from bs4 import BeautifulSoup
import requests
from collections import ChainMap
pages=[]
for i in range(0,5):
url = 'https://sportmedbc.com/practitioners?field_profile_first_name_value=&field_profile_last_name_value=&field_pract_profession_tid=All&city=&taxonomy_vocabulary_5_tid=All&page='+str(i)
pages.append(url)
for item in pages:
page=requests.get(item)
soup = BeautifulSoup(page.text, 'lxml')
def get_data(soup):
default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}
for doctor in soup.select('.view-practitioners .practitioner'):
doctor_data = {}
if doctor.select_one('.practitioner__name').text.strip():
doctor_data['name'] = doctor.select_one('.practitioner__name').text
if doctor.select_one('.practitioner__clinic').text.strip():
doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text
if doctor.select_one('.practitioner__profession').text.strip():
doctor_data['profession'] = doctor.select_one('.practitioner__profession').text
if doctor.select_one('.practitioner__region').text.strip():
doctor_data['region'] = doctor.select_one('.practitioner__region').text
if doctor.select_one('.practitioner__city').text.strip():
doctor_data['city'] = doctor.select_one('.practitioner__city').text
yield ChainMap(doctor_data, default_data)
for doctor in get_data(soup):
print('name:\t\t', doctor['name'])
print('clinic:\t\t',doctor['clinic'])
print('profession:\t',doctor['profession'])
print('city:\t\t',doctor['city'])
print('region:\t\t',doctor['region'])
print('-' * 80)
代码基本正常,将带有
get_soup()
的循环放在第一个循环中:印刷品:
因为soup是上面解释的第5页,所以只能得到第5页的结果
解决方案:
相关问题 更多 >
编程相关推荐