lis上的Python正则表达式

import urllib2, re from xml.dom.minidom import Document from BeautifulSoup import BeautifulSoup as bs osc = open('OSCTEST.html','r') oscread = osc.read() soup=bs(oscread) doc = Document() root = doc.createElement('root') doc.appendChild(root) countries = doc.createElement('countries') root.appendChild(countries) findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL | re.IGNORECASE).findall(soup) findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL | re.IGNORECASE).findall(soup) for header in findtags1: title_elem = doc.createElement('title') countries.appendChild(title_elem) header_elem = doc.createTextNode(header) title_elem.appendChild(header_elem) for item in findtags2: art_elem = doc.createElement('artikel') countries.appendChild(art_elem) s = item.replace('<P>','') t = s.replace('</P>','') text_elem = doc.createTextNode(t) art_elem.appendChild(text_elem) print doc.toprettyxml()

1条回答

网友

1楼 · 发布于 2024-10-03 21:31:38

尝试使用BeautifulSoup解析HTML是很好的，但这行不通：

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

您正在尝试使用正则表达式解析BeautifulSoup对象。相反，您应该在汤上使用findAll方法，如下所示：

^{pr2}$

如果您确实想用正则表达式将文档解析为文本，那么就不要使用BeautifulSoup，只需将文档读入字符串并进行解析。但我建议你花点时间来学习一下BeautifulSoup是如何工作的，因为这是最好的方法。有关详细信息，请参阅documentation。在

相关问题更多 >

编程相关推荐

热门问题

热门文章