为什么我的webcrawler没有进入下一个包含关键字的链接

1条回答

网友

1楼 · 发布于 2024-05-04 02:33:34

Mechanize不是实现所需内容的最佳工具，这将获取所有链接并使用BeautifulSoup从链接页面中提取主文本，我们可以使用dict在正确的css选择器和网站名称之间创建一个映射，使用regex从链接中提取键并传递正确的css以进行选择：

url ="https://news.google.co.uk"


import requests
import re
from bs4 import BeautifulSoup

def get_links(start):
    cont = requests.get(start).content
    soup = BeautifulSoup(cont, "lxml")
    keys = ['news','article','business', 'world']
    # links are all in the  a tag inside the esc-layout-table table
    # where the a tag class is article
    return [a["url"] for a in soup.select(".esc-layout-table a.article") if any(k in a["url"] for k in keys)]



def parse_links_text(links, css_d):
    # use regex to extract find out what page the link points to  
    # so we can pull the appropriate selector from the dict
    r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.")
    for link in links:
        print(link)
        cont = requests.get(link).content
        soup = BeautifulSoup(cont)
        css = r.search(link).group()
        p = [p.text for p in soup.select(css_d[css])]
        yield p

# map each page to its correct css selector to pull the main text
d = {"dailymail.": "p.mol-para-with-font","telegraph.":"#mainBodyArea",
     "bbc.": "div.story-body p","independent.":"div.text-wrapper p"}

for text in (parse_links_text(get_links(url), d)):
    print(text)

这是从《每日电讯报》（telegraph）、《每日邮报》（dailymail）、《英国广播公司》（bbc）和《独立报》（independent）链接上的所有文章的正文。如果一个标签可以得到你想要的所有数据，你就必须为其他页面添加更多潜在的选择器，或者在html发生变化时调整它们。在

输出片段：

^{pr2}$

当然，您可以p = [p.text for p in soup.select("p")]从段落中选择所有文本，但这将包含大量您不需要的数据。如果您只对某些页面感兴趣，您还可以根据是否在css_ddict中找到匹配项进行筛选，方法如下：

for link in links:
    cont = requests.get(link).content
    soup = BeautifulSoup(cont)
    css = r.search(link)
    if not css: 
       continue
    css = css.group()
    yield [p.text for p in soup.select(css)]

正如评论中所讨论的，对于灵活性来说，lxml是一个很好的工具，为了获得这些部分，我们可以使用以下代码：

from urlparse import urljoin
import requests

url = "https://news.google.co.uk"



def get_sections(start, sections):
    '''Pulls the links for each sections we pass in, i.e World, Business etc...'''
    cont = requests.get(start).content
    xml = fromstring(cont, HTMLParser())
    # links are all in the  a tag inside the esc-layout-table table
    # where the a tag class is article
    secs = xml.xpath("//span[@class='section-name']")
    for sec in secs:
        _sec = sec.text.rsplit(None, 1)[0].lower().rstrip(".")
        if _sec in sections:
            yield _sec, urljoin(url, sec.xpath(".//parent::a/@href")[0])


def get_section_links(sec_url):
    ''''Get all links from individual sections.'''
    cont = requests.get(sec_url).content
    xml = fromstring(cont, HTMLParser())
    seen = set()
    for url in xml.xpath("//div[@class='section-stream-content']//a/@url"):
        if url not in seen:
            yield url
        seen.add(url)

    # set of sections we want
s = {'business', 'world', "sports", "u.k"}

for sec, link in get_sections(url, s):
    for sec_link in (get_section_links(link)):
        print(sec, sec_link)

因此，如果我们运行上面的代码，我们会得到每个部分的所有链接，下面是每个部分的一个非常小的片段，实际上返回了大量的链接：

(u'world', 'http://www.theguardian.com/commentisfree/2016/mar/21/new-york-millionaires-who-want-taxes-raised')
(u'world', 'http://www.abc.net.au/news/2016-03-22/berg-turnbull%27s-only-real-option-was-bluff-and-bravado/7264350')
(u'world', 'http://www.swissinfo.ch/eng/reuters/australian-pm-takes-bold-gamble sets-in-motion-july-2-poll/42037074')
(u'world', 'https://www.washingtonpost.com/news/checkpoint/wp/2016/03/21/these-are-the-new-u-s-military-bases-near-the-south-china-sea-china-isnt-impressed/')
(u'world', 'http://www.reuters.com/article/southchinasea-china-usa-idUSL3N16T3BH')
(u'world', 'http://atimes.com/2016/03/philippine-election-question-marks-sow-panic-in-south-china-sea/')
(u'world', 'http://www.manilatimes.net/what-if-china-attacks-bases-used-by-america/251946/')
(u'world', 'http://www.arabnews.com/world/news/898816')
(u'world', 'http://macaudailytimes.com.mo/koreas-seoul-north-korea-fires-five-short-range-projectiles.html')
(u'world', 'http://gulftoday.ae/portal/cb0e2530-0769-411d-9622-2e991191656b.aspx')
(u'world', 'http://38north.org/2016/03/aabrahamian032116/')
(u'u.k', 'http://www.irishnews.com/news/2016/03/22/news/judge-tells-madonna-and-richie-to-settle-rocco-dispute-458929/')
(u'u.k', 'http://www.marilynstowe.co.uk/2016/03/21/judge-urges-amicable-resolution-in-madonna-dispute-over-son/')
(u'u.k', 'http://www.mercurynews.com/celebrities/ci_29666212/judge-tells-madonna-and-guy-ritchie-get-it')
(u'u.k', 'http://www.telegraph.co.uk/news/celebritynews/madonna/12199922/Madonnas-UK-court-fight-with-Guy-Ritchie-over-son-Rocco-can-end-judge-rules.html')
(u'u.k', 'http://www.pbo.co.uk/news/boaty-mcboatface-leading-public-vote-to-name-200m-polar-research-ship-28429')
(u'u.k', 'http://www.theguardian.com/environment/shortcuts/2016/mar/21/from-bell-end-boaty-mcboatface-trouble-letting-public-name-things')
(u'u.k', 'http://www.independent.co.uk/news/uk/boaty-mcboatface-debacle-shows-the-perils-of-crowdsourcing-opinion-from-hooty-mcowlface-to-mr-a6944801.html')
(u'u.k', 'http://www.sacbee.com/news/nation-world/world/article67322252.html')
(u'u.k', 'http://www.westerndailypress.co.uk/Jury-discharged-manslaughter-case-Thomas-Orchard/story-28964162-detail/story.html')
(u'u.k', 'http://www.exeterexpressandecho.co.uk/Breaking-Thomas-Orchard-manslaughter-trial-jury/story-28963859-detail/story.html')
(u'u.k', 'http://www.theguardian.com/uk-news/2016/mar/21/thomas-orchard-trial-jury-discharged-judge-halts-proceedings')
(u'u.k', 'http://www.ft.com/cms/s/0/0bf3e966-ef57-11e5-9f20-c3a047354386.html')
(u'u.k', 'http://www.theweek.co.uk/london-mayor-election-2016/62681/london-mayor-election-2016-whos-in-the-running-as-starting-gun')
(u'business', 'https://uk.finance.yahoo.com/news/companies-may-soon-stop-reporting-162707837.html')
(u'business', 'http://www.theweek.co.uk/70785/why-youre-about-to-stop-getting-quarterly-reports-on-your-investments')
(u'business', 'http://uk.reuters.com/article/uk-starwood-hotels-m-a-marriott-idUKKCN0WN142')
(u'business', 'http://www.reuters.com/article/us-global-oil-idUSKCN0WN00I')
(u'business', 'http://www.digitallook.com/news/commodities/commodities-oil-futures-recoup-previous-sessions-losses 1087119.html')
(u'business', 'http://news.sky.com/story/1664056/new-top-dog-at-pets-at-home-as-ceo-retires')
(u'business', 'http://money.aol.co.uk/2016/03/21/sky-tv-price-hike-shock/')
(u'business', 'http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=11609694')
(u'business', 'http://www.dailymail.co.uk/sciencetech/article-3502838/The-Flying-Bum-ready-lift-World-s-largest-aircraft-Airlander-10-fitted-fins-engines-ahead-flight.html')
(u'business', 'http://www.business-standard.com/article/pti-stories/world-s-longest-aircraft-revealed-in-new-pictures-116032000569_1.html')
(u'sports', 'http://www.telegraph.co.uk/football/2016/03/21/gary-neville-consulted-roy-hodgson-on-england-delay/')
(u'sports', 'http://www.dailymail.co.uk/sport/football/article-3502767/Gary-Neville-leaving-Valencia-join-England-gritted-teeth-feels-like-La-Liga-club-giving-fans-chant-manager-now.html')
(u'sports', 'http://www.irishexaminer.com/sport/soccer/gary-neville-in-firing-line-as-valencia-lose-again-388634.html')
(u'sports', 'http://timesofindia.indiatimes.com/sports/tennis/top-stories/Male-tennis-players-should-earn-more-than-females-Djokovic/articleshow/51499959.cms')
(u'sports', 'http://www.sport24.co.za/soccer/livescoring?mid=23948674&st=football')
(u'sports', 'http://www.dispatch.com/content/stories/sports/2016/03/21/0321-serena-williams-rips-indian-wells-ceo.html')
(u'sports', 'http://www.bbc.co.uk/sport/football/35864765')
(u'sports', 'http://indianexpress.com/article/sports/football/joachim-loew-throws-max-kruse-out-of-germany-squad/')
(u'sports', 'http://www.si.com/planet-futbol/2016/03/21/max-kruse-germany-kicked-jogi-low')
(u'sports', 'http://www.dw.com/en/coach-joachim-l%C3%B6w-drops-max-kruse-from-german-national-team/a-19132035')
(u'sports', 'http://www.bbc.co.uk/sport/football/35865092')
(u'sports', 'http://news.sky.com/story/1664218')
(u'sports', 'http://www.theguardian.com/business/2016/mar/21/sports-direct-founder-mike-ashley-snubs-call-mps-parliamentary-select-committee')
(u'sports', 'http://www.mirror.co.uk/news/business/sports-direct-boss-mike-ashley-7604067')
(u'sports', 'http://www.independent.ie/sport/soccer/mike-ashley-says-he-is-wedded-to-newcastle-even-if-they-go-down-34558617.html')
(u'sports', 'http://www.heraldscotland.com/sport/14373924.Michael_Carrick_praises_performance_after_United_win_Manchester_derby/')
(u'sports', 'http://www.dorsetecho.co.uk/sport/national/14373773.Michael_Carrick_hails_vital_Manchester_derby_victory/')

如果我们只返回一个set get_section_链接，我们可以将其传递给函数来解析文本：

def get_section_links(sec_url):
    cont = requests.get(sec_url).content
    xml = fromstring(cont, HTMLParser())
    return set(xml.xpath("//div[@class='section-stream-content']//a/@url"))

因此，使用lxml来使用xpaths进行解析，对于我们已经解析过的几个站点，我们可以添加更多的逻辑来捕捉变化：

# map each page to its correct css selector to pull the main text
d = {"dailymail.": "//div[@itemprop='articleBody']//p",
     "telegraph.": "//div[@id='mainBodyArea']//p",
     "bbc.": "//div[@class='story-body']//p",
     "independent.": "//div[@class='text-wrapper']//p",
     "www.mirror.": "//*[@class='live-now-entry' or @class='lead-entry' or @itemprop='articleBody']//p"}


import logging

logger = logging.getLogger(__file__)
logging.basicConfig()
logger.setLevel(logging.DEBUG)


def parse_links_text(links, xpath_d):
    # use regex to extract find out what page the link points to
    # so we can pull the appropriate xpath from the dict
    r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.|www.mirror.")
    for link in links:
        try:
            cont = requests.get(link).content
        except requests.exceptions.RequestException as e:
            logging.error(e.message)
            continue
        xml = fromstring(cont, HTMLParser())
        xpath = r.search(link)
        if xpath:
            p = "".join(filter(None, ("".join(p.xpath("normalize-space(.//text())"))
                                      for p in xml.xpath(xpath_d[xpath.group()]))))
            if p:
                yield p
        else:
            logger.debug("No match for {}".format(link))

同样，你将不得不决定哪些站点可以点击，并找到正确的xpath来获取主要文章的文本，但这会让你走得很好。当我有更多时间时，我将添加一些逻辑来异步运行请求。在

相关问题更多 >

编程相关推荐

热门问题

热门文章