为什么我的webcrawler没有进入下一个包含关键字的链接

2024-05-04 02:33:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我写了一个简单的webcrawler,它最终将只关注新闻链接,将文章文本刮到数据库中。我在跟踪源url的链接时遇到问题。以下是目前为止的代码:

import urlparse
import mechanize

url ="https://news.google.co.uk"

def spider(root, steps):
    urls = [root]
    visited =[root]
    counter = 0
    while counter < steps:
        step_url = scrape(urls)
        urls = []
        for u in step_url:
            if u not in visited:
                urls.append(u)
                visited.append(u)
        counter+=1
    return visited

def scrape(root):
    result_urls = []
    br = Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Chrome')]
    for url in root:
        try:
            br.open(url)
            keyWords = ['news','article','business', 'world']
            for link in br.links():
                newurl = urlparse.urljoin(link.base_url,link.url)
                result_urls.append(newurl)
                [newslinks for newslinks in result_urls if newslinks in keyWords]
                print newslinks
        except:
            print "scrape error"
    return result_urls

print spider(url, 2)

在编辑:NLTK在

^{pr2}$

然后添加到数据库之后。在


Tags: inbrurlfor链接counterlinkroot
1条回答
网友
1楼 · 发布于 2024-05-04 02:33:34

Mechanize不是实现所需内容的最佳工具,这将获取所有链接并使用BeautifulSoup从链接页面中提取主文本,我们可以使用dict在正确的css选择器和网站名称之间创建一个映射,使用regex从链接中提取键并传递正确的css以进行选择:

url ="https://news.google.co.uk"


import requests
import re
from bs4 import BeautifulSoup

def get_links(start):
    cont = requests.get(start).content
    soup = BeautifulSoup(cont, "lxml")
    keys = ['news','article','business', 'world']
    # links are all in the  a tag inside the esc-layout-table table
    # where the a tag class is article
    return [a["url"] for a in soup.select(".esc-layout-table a.article") if any(k in a["url"] for k in keys)]



def parse_links_text(links, css_d):
    # use regex to extract find out what page the link points to  
    # so we can pull the appropriate selector from the dict
    r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.")
    for link in links:
        print(link)
        cont = requests.get(link).content
        soup = BeautifulSoup(cont)
        css = r.search(link).group()
        p = [p.text for p in soup.select(css_d[css])]
        yield p

# map each page to its correct css selector to pull the main text
d = {"dailymail.": "p.mol-para-with-font","telegraph.":"#mainBodyArea",
     "bbc.": "div.story-body p","independent.":"div.text-wrapper p"}

for text in (parse_links_text(get_links(url), d)):
    print(text)

这是从《每日电讯报》(telegraph)、《每日邮报》(dailymail)、《英国广播公司》(bbc)和《独立报》(independent)链接上的所有文章的正文。如果一个标签可以得到你想要的所有数据,你就必须为其他页面添加更多潜在的选择器,或者在html发生变化时调整它们。在

输出片段:

^{pr2}$

当然,您可以p = [p.text for p in soup.select("p")]从段落中选择所有文本,但这将包含大量您不需要的数据。如果您只对某些页面感兴趣,您还可以根据是否在css_ddict中找到匹配项进行筛选,方法如下:

for link in links:
    cont = requests.get(link).content
    soup = BeautifulSoup(cont)
    css = r.search(link)
    if not css: 
       continue
    css = css.group()
    yield [p.text for p in soup.select(css)]

正如评论中所讨论的,对于灵活性来说,lxml是一个很好的工具,为了获得这些部分,我们可以使用以下代码:

from urlparse import urljoin
import requests

url = "https://news.google.co.uk"



def get_sections(start, sections):
    '''Pulls the links for each sections we pass in, i.e World, Business etc...'''
    cont = requests.get(start).content
    xml = fromstring(cont, HTMLParser())
    # links are all in the  a tag inside the esc-layout-table table
    # where the a tag class is article
    secs = xml.xpath("//span[@class='section-name']")
    for sec in secs:
        _sec = sec.text.rsplit(None, 1)[0].lower().rstrip(".")
        if _sec in sections:
            yield _sec, urljoin(url, sec.xpath(".//parent::a/@href")[0])


def get_section_links(sec_url):
    ''''Get all links from individual sections.'''
    cont = requests.get(sec_url).content
    xml = fromstring(cont, HTMLParser())
    seen = set()
    for url in xml.xpath("//div[@class='section-stream-content']//a/@url"):
        if url not in seen:
            yield url
        seen.add(url)

    # set of sections we want
s = {'business', 'world', "sports", "u.k"}

for sec, link in get_sections(url, s):
    for sec_link in (get_section_links(link)):
        print(sec, sec_link)

因此,如果我们运行上面的代码,我们会得到每个部分的所有链接,下面是每个部分的一个非常小的片段,实际上返回了大量的链接:

(u'world', 'http://www.theguardian.com/commentisfree/2016/mar/21/new-york-millionaires-who-want-taxes-raised')
(u'world', 'http://www.abc.net.au/news/2016-03-22/berg-turnbull%27s-only-real-option-was-bluff-and-bravado/7264350')
(u'world', 'http://www.swissinfo.ch/eng/reuters/australian-pm-takes-bold-gamble sets-in-motion-july-2-poll/42037074')
(u'world', 'https://www.washingtonpost.com/news/checkpoint/wp/2016/03/21/these-are-the-new-u-s-military-bases-near-the-south-china-sea-china-isnt-impressed/')
(u'world', 'http://www.reuters.com/article/southchinasea-china-usa-idUSL3N16T3BH')
(u'world', 'http://atimes.com/2016/03/philippine-election-question-marks-sow-panic-in-south-china-sea/')
(u'world', 'http://www.manilatimes.net/what-if-china-attacks-bases-used-by-america/251946/')
(u'world', 'http://www.arabnews.com/world/news/898816')
(u'world', 'http://macaudailytimes.com.mo/koreas-seoul-north-korea-fires-five-short-range-projectiles.html')
(u'world', 'http://gulftoday.ae/portal/cb0e2530-0769-411d-9622-2e991191656b.aspx')
(u'world', 'http://38north.org/2016/03/aabrahamian032116/')
(u'u.k', 'http://www.irishnews.com/news/2016/03/22/news/judge-tells-madonna-and-richie-to-settle-rocco-dispute-458929/')
(u'u.k', 'http://www.marilynstowe.co.uk/2016/03/21/judge-urges-amicable-resolution-in-madonna-dispute-over-son/')
(u'u.k', 'http://www.mercurynews.com/celebrities/ci_29666212/judge-tells-madonna-and-guy-ritchie-get-it')
(u'u.k', 'http://www.telegraph.co.uk/news/celebritynews/madonna/12199922/Madonnas-UK-court-fight-with-Guy-Ritchie-over-son-Rocco-can-end-judge-rules.html')
(u'u.k', 'http://www.pbo.co.uk/news/boaty-mcboatface-leading-public-vote-to-name-200m-polar-research-ship-28429')
(u'u.k', 'http://www.theguardian.com/environment/shortcuts/2016/mar/21/from-bell-end-boaty-mcboatface-trouble-letting-public-name-things')
(u'u.k', 'http://www.independent.co.uk/news/uk/boaty-mcboatface-debacle-shows-the-perils-of-crowdsourcing-opinion-from-hooty-mcowlface-to-mr-a6944801.html')
(u'u.k', 'http://www.sacbee.com/news/nation-world/world/article67322252.html')
(u'u.k', 'http://www.westerndailypress.co.uk/Jury-discharged-manslaughter-case-Thomas-Orchard/story-28964162-detail/story.html')
(u'u.k', 'http://www.exeterexpressandecho.co.uk/Breaking-Thomas-Orchard-manslaughter-trial-jury/story-28963859-detail/story.html')
(u'u.k', 'http://www.theguardian.com/uk-news/2016/mar/21/thomas-orchard-trial-jury-discharged-judge-halts-proceedings')
(u'u.k', 'http://www.ft.com/cms/s/0/0bf3e966-ef57-11e5-9f20-c3a047354386.html')
(u'u.k', 'http://www.theweek.co.uk/london-mayor-election-2016/62681/london-mayor-election-2016-whos-in-the-running-as-starting-gun')
(u'business', 'https://uk.finance.yahoo.com/news/companies-may-soon-stop-reporting-162707837.html')
(u'business', 'http://www.theweek.co.uk/70785/why-youre-about-to-stop-getting-quarterly-reports-on-your-investments')
(u'business', 'http://uk.reuters.com/article/uk-starwood-hotels-m-a-marriott-idUKKCN0WN142')
(u'business', 'http://www.reuters.com/article/us-global-oil-idUSKCN0WN00I')
(u'business', 'http://www.digitallook.com/news/commodities/commodities-oil-futures-recoup-previous-sessions-losses 1087119.html')
(u'business', 'http://news.sky.com/story/1664056/new-top-dog-at-pets-at-home-as-ceo-retires')
(u'business', 'http://money.aol.co.uk/2016/03/21/sky-tv-price-hike-shock/')
(u'business', 'http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=11609694')
(u'business', 'http://www.dailymail.co.uk/sciencetech/article-3502838/The-Flying-Bum-ready-lift-World-s-largest-aircraft-Airlander-10-fitted-fins-engines-ahead-flight.html')
(u'business', 'http://www.business-standard.com/article/pti-stories/world-s-longest-aircraft-revealed-in-new-pictures-116032000569_1.html')
(u'sports', 'http://www.telegraph.co.uk/football/2016/03/21/gary-neville-consulted-roy-hodgson-on-england-delay/')
(u'sports', 'http://www.dailymail.co.uk/sport/football/article-3502767/Gary-Neville-leaving-Valencia-join-England-gritted-teeth-feels-like-La-Liga-club-giving-fans-chant-manager-now.html')
(u'sports', 'http://www.irishexaminer.com/sport/soccer/gary-neville-in-firing-line-as-valencia-lose-again-388634.html')
(u'sports', 'http://timesofindia.indiatimes.com/sports/tennis/top-stories/Male-tennis-players-should-earn-more-than-females-Djokovic/articleshow/51499959.cms')
(u'sports', 'http://www.sport24.co.za/soccer/livescoring?mid=23948674&st=football')
(u'sports', 'http://www.dispatch.com/content/stories/sports/2016/03/21/0321-serena-williams-rips-indian-wells-ceo.html')
(u'sports', 'http://www.bbc.co.uk/sport/football/35864765')
(u'sports', 'http://indianexpress.com/article/sports/football/joachim-loew-throws-max-kruse-out-of-germany-squad/')
(u'sports', 'http://www.si.com/planet-futbol/2016/03/21/max-kruse-germany-kicked-jogi-low')
(u'sports', 'http://www.dw.com/en/coach-joachim-l%C3%B6w-drops-max-kruse-from-german-national-team/a-19132035')
(u'sports', 'http://www.bbc.co.uk/sport/football/35865092')
(u'sports', 'http://news.sky.com/story/1664218')
(u'sports', 'http://www.theguardian.com/business/2016/mar/21/sports-direct-founder-mike-ashley-snubs-call-mps-parliamentary-select-committee')
(u'sports', 'http://www.mirror.co.uk/news/business/sports-direct-boss-mike-ashley-7604067')
(u'sports', 'http://www.independent.ie/sport/soccer/mike-ashley-says-he-is-wedded-to-newcastle-even-if-they-go-down-34558617.html')
(u'sports', 'http://www.heraldscotland.com/sport/14373924.Michael_Carrick_praises_performance_after_United_win_Manchester_derby/')
(u'sports', 'http://www.dorsetecho.co.uk/sport/national/14373773.Michael_Carrick_hails_vital_Manchester_derby_victory/')

如果我们只返回一个set get_section_链接,我们可以将其传递给函数来解析文本:

def get_section_links(sec_url):
    cont = requests.get(sec_url).content
    xml = fromstring(cont, HTMLParser())
    return set(xml.xpath("//div[@class='section-stream-content']//a/@url"))

因此,使用lxml来使用xpaths进行解析,对于我们已经解析过的几个站点,我们可以添加更多的逻辑来捕捉变化:

# map each page to its correct css selector to pull the main text
d = {"dailymail.": "//div[@itemprop='articleBody']//p",
     "telegraph.": "//div[@id='mainBodyArea']//p",
     "bbc.": "//div[@class='story-body']//p",
     "independent.": "//div[@class='text-wrapper']//p",
     "www.mirror.": "//*[@class='live-now-entry' or @class='lead-entry' or @itemprop='articleBody']//p"}


import logging

logger = logging.getLogger(__file__)
logging.basicConfig()
logger.setLevel(logging.DEBUG)


def parse_links_text(links, xpath_d):
    # use regex to extract find out what page the link points to
    # so we can pull the appropriate xpath from the dict
    r = re.compile("telegraph\.|bbc\.|dailymail\.|independent\.|www.mirror.")
    for link in links:
        try:
            cont = requests.get(link).content
        except requests.exceptions.RequestException as e:
            logging.error(e.message)
            continue
        xml = fromstring(cont, HTMLParser())
        xpath = r.search(link)
        if xpath:
            p = "".join(filter(None, ("".join(p.xpath("normalize-space(.//text())"))
                                      for p in xml.xpath(xpath_d[xpath.group()]))))
            if p:
                yield p
        else:
            logger.debug("No match for {}".format(link))

同样,你将不得不决定哪些站点可以点击,并找到正确的xpath来获取主要文章的文本,但这会让你走得很好。当我有更多时间时,我将添加一些逻辑来异步运行请求。在

相关问题 更多 >