来自2个类的数据帧

2024-06-26 02:15:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我想提取Arbeitsatmosphare排名和Stadt信息,这些信息基于下面网站上所有页面的审查数据,所以期望的输出应该如下面的示例所示

         Arbeitsatmosphare | Stadt
   1.      4.00            | Berlin     
   2.      5.00            | Frankfurt
   3.      3.00            | Munich
   4.      5.00            | Berlin
   5.      4.00            | Berlin

下面的代码从所有页面中提取pro数据,效果良好。我试图更新它,并在其中添加两个列表,Arbeitsatmosphare rank和Statt,如果Arbeitsatmosphare rank信息丢失,则中断循环,但我的代码不起作用。你能帮忙吗

pro = []

with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        new_comments = [
            pro.find_next_sibling('p').get_text()
            for pro in soup.find_all('h2', text='Pro')
        ]
        if not new_comments:
            print(f"No more comments. Page: {page}")
            break
        pro += new_comments
        print(pro)
        #print(len(pro))
        page += 1
print(pro)

UPD 添加不起作用的代码,但是我认为应该有更简单的解决方案

Arbeit = []
Stadt=[]

with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        new_comments1 = [
            Arbeit.find_next_sibling('span').get_text()
            for Arbeit in soup.find_all('span', text='Arbeitsatmosphäre')
        ]
        new_comments2 = [
            Stadt.find_next_sibling('div').get_text()
            for Stadt in soup.find_all('div', text='Stadt')
        ]
        if not new_comments1:
            print(f"No more comments. Page: {page}")
            break
        Arbeit += new_comments1
        Stadt += new_comments2
        print(Arbeit)
        print(Stadt)
        #print(len(pro))
        page += 1

Tags: texturlnewgetsessionwithpagefind
1条回答
网友
1楼 · 发布于 2024-06-26 02:15:09

您可以尝试:

import requests
from bs4 import BeautifulSoup
import pandas as  pd

arbeit = []
firma = []
stadt = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:
            rating_tags = article.find_all('span', {'class' : 'rating-badge'})            
            arbeit.append(rating_tags[0].text.strip())
            detail_div = article.find_all('div', {'class' : 'review-details'})[0]
            nodes = detail_div.find_all('li')
            firma_node = nodes[0]
            stadt_node = nodes[1]
            firma_node_div = firma_node.find_all('div')
            firma_name = firma_node_div[1].text.strip()
            firma.append(firma_name)

            stadt_node_div = stadt_node.find_all('div')
            stadt_name = stadt_node_div[1].text.strip()
            stadt.append(stadt_name)                                       
        page += 1

        pagination = soup.find_all('div', {'class' : 'paginationControl'})
        if not pagination:
            break

df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)

相关问题 更多 >