如何在html中搜索链接并使用python打印链接？

网友

1楼 · 编辑于 2024-09-30 20:28:01

import httplib
from lxml import html

#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS... 
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
    conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()

if res.status == 200:
    source = res.read()
else:
    print res.status
    print res.getheaders()

#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
    if att == 'src': #or 'href'
        print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)

网友

2楼 · 编辑于 2024-09-30 20:28:01

您可以尝试使用lxml（http://lxml.de/）和xpath（http://en.wikipedia.org/wiki/XPath）

例如，要在html中查找图像，可以

import lxml.html
import requests

html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
    print images[0].get('src') # here I get the src from the first img
else:
    print "Images not found"

我希望这能对你有所帮助。你知道吗

更新：我修复了之前没有的else“：”

网友

3楼 · 编辑于 2024-09-30 20:28:01

漂亮的汤文档有很好的“快速入门”部分：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

from bs4 import BeautifulSoup as Soup
from urllib import urlopen

url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)

# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>

# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
    print img_tag['src']

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在html中搜索链接并使用python打印链接？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >