无法用尽我的废品中使用的所有相同的网址的内容问题的回答

无法用尽我的废品中使用的所有相同的网址的内容

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

此解决方案尝试查找分页<code>a</code>标记。如果找到任何分页，则当用户在类<code>PageScraper</code>的实例上迭代时，将刮取所有页面。否则，将只对第一个结果（单个页面）进行爬网： <pre><code>import requests from bs4 import BeautifulSoup as soup import contextlib def has_pagination(f): def wrapper(cls): if not cls._pages: raise ValueError('No pagination found') return f(cls) return wrapper class PageScraper: def __init__(self, url:str): self.url = url self._home_page = requests.get(self.url).text self._pages = [i.text for i in soup(self._home_page, 'html.parser').find('div', {'class':'pagination'}).find_all('a')][:-1] @property def first_page(self): return [i.find('h2', {'class':'table-row-heading'}).text for i in soup(self._home_page, 'html.parser').find_all('td', {'class':'table-row-price'})] @has_pagination def __iter__(self): for p in self._pages: _link = requests.get(f'{self.url}/page/{p}').text yield [i.find('h2', {'class':'table-row-heading'}).text for i in soup(_link, 'html.parser').find_all('td', {'class':'table-row-price'})] @classmethod @contextlib.contextmanager def feed_link(cls, link): results = cls(link) try: yield results.first_page for i in results: yield i except: yield results.first_page </code></pre> 类的构造函数将找到任何分页，并且<code>__iter__</code>方法只在找到分页链接的情况下保存所有页面。例如，<a href="https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all" rel="nofollow noreferrer">https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all</a>没有分页。因此： ^{pr2}$ <blockquote> ValueError: No pagination found </blockquote> 但是，第一页的内容可以找到： <pre><code>print(r.first_page) ['Forest Park MHP', 'Gansett Mobile Home Park', 'Meadowlark Park', 'Indian Cedar Mobile Homes Inc', 'Sherwood Valley Adult Mobile', 'Tripp Mobile Home Park', 'Ramblewood Estates', 'Countryside Trailer Park', 'Village At Wordens Pond', 'Greenwich West Inc', 'Dadson Mobile Home Estates', "Oliveira's Garage", 'Tuckertown Village Clubhouse', 'Westwood Estates'] </code></pre> 但是，对于分页完整的页面，所有生成的页面都可以被刮掉： <pre><code>r = PageScraper('https://www.mobilehome.net/mobile-home-park-directory/maine/all') d = [i for i in r] </code></pre> <code>PageScraper.feed_link</code>将自动完成此检查，并输出第一页，如果找到分页，则输出所有后续结果，如果结果中不存在分页，则只输出第一页： <pre><code>urls = {'https://www.mobilehome.net/mobile-home-park-directory/maine/all', 'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all', 'https://www.mobilehome.net/mobile-home-park-directory/vermont/all', 'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all'} for url in urls: with PageScraper.feed_link(url) as r: print(r) </code></pre>

无法用尽我的废品中使用的所有相同的网址的内容

1 个回答

相关Python问题