用Python进行分页

2024-06-28 11:32:39 发布

您现在位置:Python中文网/ 问答频道 /正文

{1{m正在尝试从网站上获取以下数据。在

我设法得到了我需要的数据,但我正在努力在网页上分页。我想知道所有评论的标题(不只是第一页的标题)。在

页面链接的格式是:http://www.airlinequality.com/airline-reviews/airasia-x/page/3/,其中3是页面的编号。在

我试图遍历这些url和下面的代码段,但是对分页的抓取不起作用。在

# follow pagination links
for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
    yield response.follow(href, self.parse)

你能帮帮我吗?提前谢谢你。在

^{pr2}$

为了遍历航空公司,我用以下代码解决了这个问题: 它使用上面的代码:

req = Request("http://www.airlinequality.com/review-pages/a-z-airline-reviews/" , headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req)
soupAirlines = BeautifulSoup(html_page, "lxml")

URL_LIST = []
for link in soupAirlines.findAll('a',  attrs={'href': re.compile("^/airline-reviews/")}):
    URL_LIST.append("http://www.airlinequality.com"+link.get('href'))

Tags: 数据incomhttp标题forresponsewww
1条回答
网友
1楼 · 发布于 2024-06-28 11:32:39

假设scrapy不是一个硬性要求,BeautifulSoup中的以下代码将为您提供所有的评论,并解析出元数据,最后输出pandas数据帧。从每次审核中提取的特定属性包括:

  • 职称评审
  • 额定值(可用时)
  • 评分超出范围(即满分10分)
  • 审阅全文
  • 审核日期戳
  • 评审是否经过验证

有一个特定的函数来处理分页。这是一个递归函数,因为如果有下一页,我们将再次调用该函数来解析新的url,否则该函数将调用end。在

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# define global parameters
URL = 'http://www.airlinequality.com/airline-reviews/airasia-x'
BASE_URL = 'http://www.airlinequality.com'
MASTER_LIST = []

def parse_review(review):
    """
    Parse important review meta data such as ratings, time of review, title, 
    etc.

    Parameters
       -
    review - beautifulsoup tag 

    Return 
       -
    outdf - pd.DataFrame
        DataFrame representation of parsed review
    """

    # get review header
    header = review.find('h2').text

    # get the numerical rating
    base_review = review.find('div', {'itemprop': 'reviewRating'})
    if base_review is None:
        rating = None
        rating_out_of = None
    else:
        rating = base_review.find('span', {'itemprop': 'ratingValue'}).text
        rating_out_of = base_review.find('span', {'itemprop': 'bestRating'}).text

    # get time of review
    time_of_review = review.find('h3').find('time')['datetime']

    # get whether review is verified
    if review.find('em'):
        verified = review.find('em').text
    else:
        verified = None

    # get actual text of review
    review_text = review.find('div', {'class': 'text_content'}).text

    outdf = pd.DataFrame({'header': header,
                         'rating': rating,
                         'rating_out_of': rating_out_of,
                         'time_of_review': time_of_review,
                         'verified': verified,
                         'review_text': review_text}, index=[0])

    return outdf

def return_next_page(soup):
    """
    return next_url if pagination continues else return None

    Parameters
       -
    soup - BeautifulSoup object - required

    Return 
       -
    next_url - str or None if no next page
    """
    next_url = None
    cur_page = soup.find('a', {'class': 'active'}, href=re.compile('airline-reviews/airasia'))
    cur_href = cur_page['href']
    # check if next page exists
    search_next = cur_page.findNext('li').get('class')
    if not search_next:
        next_page_href = cur_page.findNext('li').find('a')['href']
        next_url = BASE_URL + next_page_href
    return next_url

def create_soup_reviews(url):
    """
    iterate over each review, extract out content, and handle next page logic 
    through recursion

    Parameters
       -
    url - str - required
        input url
    """
    # use global MASTER_LIST to extend list of all reviews 
    global MASTER_LIST
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    reviews = soup.findAll('article', {'itemprop': 'review'})
    review_list = [parse_review(review) for review in reviews]
    MASTER_LIST.extend(review_list)
    next_url = return_next_page(soup)
    if next_url is not None:
        create_soup_reviews(next_url)


create_soup_reviews(URL)


finaldf = pd.concat(MASTER_LIST)
finaldf.shape # (339, 6)

finaldf.head(2)
# header    rating  rating_out_of   review_text time_of_review  verified
#"if approved I will get my money back" 1   10  ✅ Trip Verified | Kuala Lumpur to Melbourne. ...    2018-08-07  Trip Verified
#   "a few minutes error"   3   10  ✅ Trip Verified | I've flied with AirAsia man...    2018-08-06  Trip Verified

如果我要做整个站点,我会使用上面的方法,并在每个航空公司上迭代here。我将修改代码以包含一个名为“airline”的列,这样您就可以知道每个评审对应的是哪个航空公司。在

相关问题 更多 >