使用regex从arti获取信息

import urllib2 from bs4 import BeautifulSoup import re from time import * url: http://www.reuters.com/article/2014/02/26/us-afghanistan-usa-militants-idUSBREA1O1SV20140226 # Parse HTML of article, aka making soup soup = BeautifulSoup(urllib2.urlopen(url).read()) # Write the article author to the file regex = '<p class="byline">(.+?)</p>' pattern = re.compile(regex) byline = re.findall(pattern,str(soup)) txt.write("Author: " + str(byline) + '\n' + '\n') # Write the article date to the file regex = '<span class="timestamp">(.+?)</span>' pattern = re.compile(regex) byline = re.findall(pattern,str(soup)) txt.write("Date: " + str(byline) + '\n' + '\n')

1条回答

网友

1楼 · 发布于 2024-06-18 13:07:06

您可以使用BeautifulSoup来准确地获取您所需要的内容，方法与您描述的几乎相同，只是没有regex。因为您知道您感兴趣的标记的特征，所以可以直接使用bs4的find来搜索它们

#make some soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

#extract byline and date text from their respective tags
try:
    byline=soup.find("p", {'class':'byline'}).text
    date=soup.find("span", {'class':'timestamp'}).text
except:
    print 'byline missing!'

更新：如果您将整个内容包装在try/except结构中，您可以解决缺少署名的情况，并定义应该发生的一些替代操作。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章