网络抓取形成新闻数据库

import mechanize from bs4 import BeautifulSoup url = "http://www.thehindu.com/archive/web/2010/06/19/" br = mechanize.Browser() htmltext = br.open(url).read() articletext = "" soup = BeautifulSoup(htmltext) for tag in soup.findAll('li', attrs={"data-section":"Business"}): articletext += tag.contents[0] print articletext

2条回答

网友

1楼 · 编辑于 2024-09-25 16:32:14

我建议你退房。用你的参数试试他们的教程，然后用它来做实验。他们有一个比机械化模块更发达的网络爬行基础设施。在

网友

2楼 · 编辑于 2024-09-25 16:32:14

请尝试以下代码：

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

driver.close()

对于re，您可能需要导入re模块。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

网络抓取形成新闻数据库

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >