如何在我们要提取的文本中忽略html标记？

import urllib2 import re url = ['http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/'] htmlfile = urllib2.urlopen('http://recipes.latimes.com/recipe-restaurant-1833s-bacon-cheddar-biscuits-maple-chile-butter/') htmltext = htmlfile.read() regex2 = '<p><span class="step_leadin">(.+?)</p>' pattern2 = re.compile(regex2) method = re.findall(pattern2,htmltext) print method

2条回答

网友

1楼 · 编辑于 2024-05-17 04:05:02

我相信heinst的答案更好，但是既然您坚持使用regex，那么您可以这样做：

import re

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

print re.sub(r'<[^>]*?>', '', html)

网友

2楼 · 编辑于 2024-05-17 04:05:02

我强烈建议您不要使用regex来解析html，因为html is not regular.应该使用类似BeautifulSoup或{a3}之类的html/xml解析器。下面是您尝试使用beauthoulGroup执行的操作的示例：

from bs4 import BeautifulSoup

html = '<p><span class="step_leadin">Step1</span>Carefully transfer the biscuits to a rimmed baking sheet, spacing them an inch or so apart</p>'

bs = BeautifulSoup(html)

for p in bs.find_all('p'):
    print p.text

相关问题更多 >

编程相关推荐

热门问题

热门文章