如何使用Python从HTML中获取特定字符串

2024-09-26 21:48:17 发布

您现在位置:Python中文网/ 问答频道 /正文

import re
import urllib
web = "http://pic.haibao.com/piclist/2271"
page = urllib.urlopen(web)
html = page.read()
pic_pat =r'src=\("http:\/\/.*?.jpg)'
impat = re.compile(keypat)
keylist = impat.findall(html)

我得到的html的一部分:

 function getList(screen_index) {
        var boxes = [];
        var screen2 = "<li class=\"piclistli\"><div class=\"pic200\"><a href=\"http:\/\/pic.haibao.com\/pic\/12027963.htm\"><img width=\"310\" height=\"465\" src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\" alt=\"\u200b1\u6708\u7684\u7ebd\u7ea6\u4f9d\u7136\u51b7\u51bd\uff0c\u4f46\u578b\u4eba\u4eec\u5e76\u6ca1\u6709\u5929\u6c14\u7684\u6076\u52a3\u800c\u968f\u4fbf\u5957\u4ef6\u8863\u670d\u5c31\u51fa\u95e8\u3002\u5373\u4fbf\u662f\u904d\u5730\u79ef\u96ea\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u6709\u578b\u7684\u5927\u8863\u548c\u9774\u5b50\uff1b\u5929\u6c14\u7070\u6697\u65f6\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u9753\u4e3d\u7684\u8272\u5f69\u6210\u4e3a\u8857\u5934\u660e\u4eae\u7684\u98ce\u666f\u3002\u62a5\u53cb\u4eec\u9a6c\u4e0a\u6765\u7ffb\u7ffb\u770b\u5427\uff01\" \/><\/a><\/div>

我希望所有的字符串都是这样的:

http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg

所以我使用pic_pat =r'src=\("http:\/\/.*?.jpg)',但得到的字符串如下:

src="http://cdn4.hbimg.cn/store/tuku/310_999/piccommon/1219/12191/D52582CA92C7F0F9E6FF938534.jpg"

我怎样才能拿到票

src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\"

作为HTML中的字符串?你知道吗


Tags: storesrchttphtmlcnjpgpicuff0c
2条回答

试试BeautifulSoup4

from bs4 import BeautifulSoup as bs
html_doc = bs(html)
img_list = html_doc.find_all('img')
for image in img_list:
    print image.get('src')

After change

改用urllib2,这是一个非常酷的从网页抓取数据的库。你知道吗

import urllib2
from lxml import html
url = "Sample url"

html_code = urllib2.urlopen(url)
parsed_source = html.fromstring(html_code) # This will give you html source as string, on which xpath can be applied.
link = parsed_source.xpath("//a/@href")    # This code will return a list of href values on the html source, this Xpath is to be modified as per the html which is displayed in the UI.

这是一个示例代码,您应该如何处理这个问题,因为您必须编写自己的xpath来获取数据。你知道吗

相关问题 更多 >

    热门问题