如何在html中搜索链接并使用python打印链接?

2024-09-30 20:28:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试用python编写一个代码,它将在html代码中搜索图像链接,我需要找到的代码是- . 我需要找到http://www.darlighting.co.uk/621-large_default/empire-double-wall-bracket-polished-chrome.jpg部分,不管链接实际上说了什么,是否有其他方法可以这样做,或者我应该研究其他方法?我可以访问标准的python模块和beautifulsoup。你知道吗


Tags: 方法代码图像httpdefault链接htmlwww
3条回答
import httplib
from lxml import html

#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS... 
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
    conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()

if res.status == 200:
    source = res.read()
else:
    print res.status
    print res.getheaders()

#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
    if att == 'src': #or 'href'
        print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)

您可以尝试使用lxml(http://lxml.de/)和xpath(http://en.wikipedia.org/wiki/XPath

例如,要在html中查找图像,可以

import lxml.html
import requests

html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
    print images[0].get('src') # here I get the src from the first img
else:
    print "Images not found"

我希望这能对你有所帮助。你知道吗

更新:我修复了之前没有的else“:”

漂亮的汤文档有很好的“快速入门”部分:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

from bs4 import BeautifulSoup as Soup
from urllib import urlopen

url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)

# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>

# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
    print img_tag['src']

相关问题 更多 >