处理html和NLTK

1条回答

网友

1楼 · 发布于 2024-06-25 23:53:23

方法A:如果您想处理只在.HTML文件中找到的HTML，那么您可以简单地用html_files = PlaintextCorpusReader(corpus_root,'.*\.html')来标识那些文件，然后使用beauthoulsoup或lxml（或者您似乎喜欢的urllib）从字符串中读取来创建DOM对象。例如：

from lxml import etree

for htm in html_files:
     dom = etree.fromstring(htm)

     #any further manipulations, e.g.:
     content = dom.cssselect('#content')        # you can also use cssselect to take all links or whatever
     maintext = content[0].text    # in case there are no other embedded tags in this one

方法B:如果您处理的是包含纯文本和html部分的“混合”文件，那么我想唯一的方法是使用正则表达式来基于标记提取html部分，例如：htm_part = re.findall('<.*?>.*?</.*?>',yourdoc,re.DOTALL)。这很可能会返回一个列表，尤其是当文件很长时，因此您可以在下一步中执行''.join(htm_part)。在只提取html部分之后，您可以继续尝试lxml或BeautifulSoup来创建DOM对象，以便更多地使用它，如上所述。如果它坏了，那么很可能是regexp出了问题，您必须进行微调。希望这有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

处理html和NLTK

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >