用Python进行分页

1条回答

网友

1楼 · 发布于 2024-06-28 11:32:39

假设scrapy不是一个硬性要求，BeautifulSoup中的以下代码将为您提供所有的评论，并解析出元数据，最后输出pandas数据帧。从每次审核中提取的特定属性包括：

职称评审
额定值（可用时）
评分超出范围（即满分10分）
审阅全文
审核日期戳
评审是否经过验证

有一个特定的函数来处理分页。这是一个递归函数，因为如果有下一页，我们将再次调用该函数来解析新的url，否则该函数将调用end。在

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# define global parameters
URL = 'http://www.airlinequality.com/airline-reviews/airasia-x'
BASE_URL = 'http://www.airlinequality.com'
MASTER_LIST = []

def parse_review(review):
    """
    Parse important review meta data such as ratings, time of review, title, 
    etc.

    Parameters
       -
    review - beautifulsoup tag 

    Return 
       -
    outdf - pd.DataFrame
        DataFrame representation of parsed review
    """

    # get review header
    header = review.find('h2').text

    # get the numerical rating
    base_review = review.find('div', {'itemprop': 'reviewRating'})
    if base_review is None:
        rating = None
        rating_out_of = None
    else:
        rating = base_review.find('span', {'itemprop': 'ratingValue'}).text
        rating_out_of = base_review.find('span', {'itemprop': 'bestRating'}).text

    # get time of review
    time_of_review = review.find('h3').find('time')['datetime']

    # get whether review is verified
    if review.find('em'):
        verified = review.find('em').text
    else:
        verified = None

    # get actual text of review
    review_text = review.find('div', {'class': 'text_content'}).text

    outdf = pd.DataFrame({'header': header,
                         'rating': rating,
                         'rating_out_of': rating_out_of,
                         'time_of_review': time_of_review,
                         'verified': verified,
                         'review_text': review_text}, index=[0])

    return outdf

def return_next_page(soup):
    """
    return next_url if pagination continues else return None

    Parameters
       -
    soup - BeautifulSoup object - required

    Return 
       -
    next_url - str or None if no next page
    """
    next_url = None
    cur_page = soup.find('a', {'class': 'active'}, href=re.compile('airline-reviews/airasia'))
    cur_href = cur_page['href']
    # check if next page exists
    search_next = cur_page.findNext('li').get('class')
    if not search_next:
        next_page_href = cur_page.findNext('li').find('a')['href']
        next_url = BASE_URL + next_page_href
    return next_url

def create_soup_reviews(url):
    """
    iterate over each review, extract out content, and handle next page logic 
    through recursion

    Parameters
       -
    url - str - required
        input url
    """
    # use global MASTER_LIST to extend list of all reviews 
    global MASTER_LIST
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    reviews = soup.findAll('article', {'itemprop': 'review'})
    review_list = [parse_review(review) for review in reviews]
    MASTER_LIST.extend(review_list)
    next_url = return_next_page(soup)
    if next_url is not None:
        create_soup_reviews(next_url)


create_soup_reviews(URL)


finaldf = pd.concat(MASTER_LIST)
finaldf.shape # (339, 6)

finaldf.head(2)
# header    rating  rating_out_of   review_text time_of_review  verified
#"if approved I will get my money back" 1   10  ✅ Trip Verified | Kuala Lumpur to Melbourne. ...    2018-08-07  Trip Verified
#   "a few minutes error"   3   10  ✅ Trip Verified | I've flied with AirAsia man...    2018-08-06  Trip Verified

如果我要做整个站点，我会使用上面的方法，并在每个航空公司上迭代here。我将修改代码以包含一个名为“airline”的列，这样您就可以知道每个评审对应的是哪个航空公司。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

用Python进行分页

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >