Python网络刮板,同一链接不同的文本,计数

2024-10-02 18:15:29 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我用Python和它的一些库制作了一个web scraper。。。它会转到给定的站点,并从该站点的链接中获取所有链接和文本。我已经过滤了结果,所以我只打印该网站上的外部链接。在

代码如下:

import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList

link = "http://www.ananda-pur.de/23.html"

newesturlDict = {}
baseAdrInsArray = []



br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)


for linkins in br.links():

    newesturl = urlparse.urljoin(linkins.base_url, linkins.url)

    linkTxt = linkins.text
    baseAdrIns = linkins.base_url

    if baseAdrIns not in baseAdrInsArray:
        baseAdrInsArray.append(baseAdrIns)

    netLocation = urlsplit(baseAdrIns)
    psl = PublicSuffixList()
    publicAddress = psl.get_public_suffix(netLocation.netloc)

    if publicAddress not in newesturl:

        if newesturl not in newesturlDict:
            newesturlDict[newesturl,linkTxt] = 1
        if newesturl in newesturlDict:
            newesturlDict[newesturl,linkTxt] += 1

newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
    print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]

结果是这样的:

^{pr2}$

我的问题是那些有不同文本的相同链接。根据打印示例,给定的站点有4个链接http://www.kriteachings.org/,但是正如您所见,这4个链接中的每一个都有不同的text:第一个是http://www.sat-nam-rasayan.de,第二个是http://www.kriteachings.org,第三个是http://www.gurudevsnr.com,第四个是http://www.3ho.de

我想得到打印结果,我可以看到多少时间的链接是在给定的网页上,但如果有不同的链接文本,它只是附加到其他文本从同一个链接。为了深入了解这个例子,我想打印如下:

http://www.ananda-pur.de/23.html  -  http://www.yogibhajan.com/ - http://www.yogibhajan.com - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de  - count:  4

说明:

(first link is given page, second is founded link, third link is acutally text of that founded link, and 4th item is how many times that link appear on given site)

我的主要问题是我不知道如何比较?!,排序?!或者告诉程序这是同一个链接,它应该附加不同的文本。在

如果没有太多的代码,这样的事情是可能的吗?我是python nooby所以我有点迷路了。。在

欢迎任何帮助或建议


Tags: in文本brimporthttp链接wwwlink
1条回答
网友
1楼 · 发布于 2024-10-02 18:15:29

将链接收集到字典中,收集链接文本并处理计数:

import cookielib

import mechanize


base_url = "http://www.ananda-pur.de/23.html"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
                  'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
    if link.url not in links:
        links[link.url] = {'count': 1, 'texts': [link.text]}
    else:
        links[link.url]['count'] += 1
        links[link.url]['texts'].append(link.text)

# printing
for link, data in links.iteritems():
    print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

印刷品:

^{pr2}$

相关问题 更多 >