我想提取Arbeitsatmosphare排名和Stadt信息,这些信息基于下面网站上所有页面的审查数据,所以期望的输出应该如下面的示例所示
Arbeitsatmosphare | Stadt
1. 4.00 | Berlin
2. 5.00 | Frankfurt
3. 3.00 | Munich
4. 5.00 | Berlin
5. 4.00 | Berlin
下面的代码从所有页面中提取pro数据,效果良好。我试图更新它,并在其中添加两个列表,Arbeitsatmosphare rank和Statt,如果Arbeitsatmosphare rank信息丢失,则中断循环,但我的代码不起作用。你能帮忙吗
pro = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
new_comments = [
pro.find_next_sibling('p').get_text()
for pro in soup.find_all('h2', text='Pro')
]
if not new_comments:
print(f"No more comments. Page: {page}")
break
pro += new_comments
print(pro)
#print(len(pro))
page += 1
print(pro)
UPD 添加不起作用的代码,但是我认为应该有更简单的解决方案
Arbeit = []
Stadt=[]
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
new_comments1 = [
Arbeit.find_next_sibling('span').get_text()
for Arbeit in soup.find_all('span', text='Arbeitsatmosphäre')
]
new_comments2 = [
Stadt.find_next_sibling('div').get_text()
for Stadt in soup.find_all('div', text='Stadt')
]
if not new_comments1:
print(f"No more comments. Page: {page}")
break
Arbeit += new_comments1
Stadt += new_comments2
print(Arbeit)
print(Stadt)
#print(len(pro))
page += 1
您可以尝试:
相关问题 更多 >
编程相关推荐