Python、beauthoulsoup或LXML使用CSS标记从HTML解析图像URL

<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>

#! /usr/bin/python # RSS Feed Parser for the Big Picture Blog # Import applicable libraries import feedparser #Import Feed for Parsing d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index") # Print feed name print d['feed']['title'] # Determine number of posts and set range maximum posts = len(d['entries']) # Collect Post URLs pointer = 0 while pointer < posts: e = d.entries[pointer] print e.link pointer = pointer + 1

3条回答

网友

1楼 · 编辑于 2024-06-30 07:40:52

使用pyparsing搜索标记相当直观：

from pyparsing import makeHTMLTags, withAttribute

imgTag,notused = makeHTMLTags('img')

# only retrieve <img> tags with class='bpImage'
imgTag.setParseAction(withAttribute(**{'class':'bpImage'}))

for img in imgTag.searchString(html):
    print img.src

网友

2楼 · 编辑于 2024-06-30 07:40:52

您发布的代码查找具有bpImage类的所有a元素。但是您的示例在img元素上有bpImage类，而不是{}。您只需：

soup.find("img", { "class" : "bpImage" })

网友

3楼 · 编辑于 2024-06-30 07:40:52

使用lxml，可以执行以下操作：

import feedparser
import lxml.html as lh
import urllib2

#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")

# Print feed name
print d['feed']['title']

# Determine number of posts and set range maximum
posts = len(d['entries'])

# Collect Post URLs
for post in d['entries']:
    link=post['link']
    print('Parsing {0}'.format(link))
    doc=lh.parse(urllib2.urlopen(link))
    imgs=doc.xpath('//img[@class="bpImage"]')
    for img in imgs:
        print(img.attrib['src'])

相关问题更多 >

编程相关推荐

热门问题

热门文章