无法用尽我的废品中使用的所有相同的网址的内容

2024-09-30 01:31:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我用python编写了一个scraper,它使用BeautifulSoup库来解析遍历网站不同页面的所有名称。我可以管理它,如果不是一个以上的网址不同的分页,这意味着有些网址有分页,有些没有,因为内容很少。在

我的问题是:如何在一个函数中编译它们来处理它们是否分页?在

我最初的尝试(它只能解析每个url的第一页的内容):

import requests 
from bs4 import BeautifulSoup

urls = {
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
    'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all',
    'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
}

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("td[class='table-row-price']"):
        name = items.select_one("h2 a").text
        print(name)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

如果有一个具有如下分页的url,我本可以完成整个过程:

^{pr2}$

但是,所有的url都没有分页。那么,我怎样才能把它们都抓到,不管有没有分页?在


Tags: namehttpsurlpark内容homegetnet
2条回答

看来我找到了解决这个问题的一个非常有效的方法。这种方法是迭代的。它将首先检查该页面中是否有可用的next pageurl。如果找到一个,它将跟踪该url并重复该过程。但是,如果任何链接没有分页,则scraper将中断并尝试另一个链接。在

这里是:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

urls = [
        'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
        'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
        'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
        'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
    ]

def get_names(link):
    while True:
        r = requests.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for items in soup.select("td[class='table-row-price']"):
            name = items.select_one("h2 a").text
            print(name)

        nextpage = soup.select_one(".pagination a.next_page")

        if not nextpage:break  #If no pagination url is there, it will break and try another link

        link = urljoin(link,nextpage.get("href"))

if __name__ == '__main__':
    for url in urls:
        get_names(url)

此解决方案尝试查找分页a标记。如果找到任何分页,则当用户在类PageScraper的实例上迭代时,将刮取所有页面。否则,将只对第一个结果(单个页面)进行爬网:

import requests
from bs4 import BeautifulSoup as soup
import contextlib
def has_pagination(f):
  def wrapper(cls):
     if not cls._pages:
       raise ValueError('No pagination found')
     return f(cls)
  return wrapper

class PageScraper:
   def __init__(self, url:str):
     self.url = url
     self._home_page = requests.get(self.url).text
     self._pages = [i.text for i in soup(self._home_page, 'html.parser').find('div', {'class':'pagination'}).find_all('a')][:-1]
   @property
   def first_page(self):
      return [i.find('h2', {'class':'table-row-heading'}).text for i in soup(self._home_page, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @has_pagination
   def __iter__(self):
     for p in self._pages:
        _link = requests.get(f'{self.url}/page/{p}').text
        yield [i.find('h2', {'class':'table-row-heading'}).text for i in soup(_link, 'html.parser').find_all('td', {'class':'table-row-price'})]
   @classmethod
   @contextlib.contextmanager
   def feed_link(cls, link):
      results = cls(link)
      try:
        yield results.first_page
        for i in results:
          yield i
      except:
         yield results.first_page

类的构造函数将找到任何分页,并且__iter__方法只在找到分页链接的情况下保存所有页面。例如,https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all没有分页。因此:

^{pr2}$

ValueError: No pagination found

但是,第一页的内容可以找到:

print(r.first_page)
['Forest Park MHP', 'Gansett Mobile Home Park', 'Meadowlark Park', 'Indian Cedar Mobile Homes Inc', 'Sherwood Valley Adult Mobile', 'Tripp Mobile Home Park', 'Ramblewood Estates', 'Countryside Trailer Park', 'Village At Wordens Pond', 'Greenwich West Inc', 'Dadson Mobile Home Estates', "Oliveira's Garage", 'Tuckertown Village Clubhouse', 'Westwood Estates']

但是,对于分页完整的页面,所有生成的页面都可以被刮掉:

r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/maine/all')
d = [i for i in r]

PageScraper.feed_link将自动完成此检查,并输出第一页,如果找到分页,则输出所有后续结果,如果结果中不存在分页,则只输出第一页:

urls = {'https://www.mobilehome.net/mobile-home-park-directory/maine/all', 'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all', 'https://www.mobilehome.net/mobile-home-park-directory/vermont/all', 'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all'}
for url in urls:
   with PageScraper.feed_link(url) as r:
      print(r)

相关问题 更多 >

    热门问题