网络爬虫获取新网站链接

import mechanize from bs4 import BeautifulSoup url = "http://www.thehindu.com/archive/web/2010/06/19/" br = mechanize.Browser() htmltext = br.open(url).read() articletext = "" soup = BeautifulSoup(htmltext) for tag in soup.findAll('li', attrs={"data-section":"Business"}): articletext += tag.contents[0] print articletext

3条回答

网友

1楼 · 编辑于 2024-09-25 16:29:13

我相信您可能需要尝试访问列表项中的文本，如下所示：

for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.string

编辑：关于从页面获取链接的一般评论

可能最容易用来收集一堆链接并在以后检索它们的数据类型是字典。在

要使用BeautifulGroup从页面获取链接，可以执行以下操作：

^{pr2}$

这将为您提供一个名为link_dictionary的字典，其中字典中的每个键都是一个字符串，它只是<a> </a>标记之间的文本内容，每个值都是href属性的值。在

如何将这与您之前的尝试相结合

现在，如果我们将此问题与您之前遇到的问题结合起来，我们可以尝试以下方法：

link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    for link in tag.findAll('a'):
        link_dictionary[link.string] = link.get('href')

如果这没有意义，或者你有更多的问题，你需要先做实验，在提出另一个更清晰的新问题之前，先尝试想出一个解决方案。

网友

2楼 · 编辑于 2024-09-25 16:29:13

你用的是林克字典。如果不是为了阅读而使用它，请尝试以下代码：

 br =  mechanize.Browser()
 htmltext = br.open(url).read()

 articletext = ""
 for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

注意：re表示正则表达式。为此，您导入re的模块。在

网友

3楼 · 编辑于 2024-09-25 16:29:13

您可能希望将强大的XPath查询语言与更快的^{}模块一起使用。就这么简单：

import urllib2
from lxml import etree

url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Business']/a"):
    print '{} ({})'.format(link.text, link.attrib['href'])

更新@data section='Chennai'

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章