从HTML页面提取内容（不包括导航）的python方法

3条回答

网友

1楼 · 编辑于 2024-05-16 18:15:48

试试Python的Beautiful Soup库。它有非常简单的方法从html文件中提取信息。在

试图从网页中提取数据需要人们用类似的方式写网页。。。但是，要传达一个看起来完全相同的页面，几乎有无数种方法，更不用说传达相同信息的所有组合了。在

你有没有试图提取某种特定类型的信息或其他最终目标？在

您可以尝试提取'div'和'p'标记中的任何内容，并比较页面中所有信息的相对大小。问题是人们可能会将信息分组到'div'和'p'的集合中（或者至少如果他们编写的是格式良好的html的话，他们会这样做！）。在

也许如果你建立了一个信息是如何相关的树（节点是“p”或“div”或其他任何节点，并且每个节点都包含相关的文本），你可以做一些分析，以确定最小的“p”或“div”，其中包含似乎是大部分信息的内容？在

[EDIT]如果你能把它放到我建议的树结构中，你就可以用一个类似的点数系统来对付垃圾邮件刺客。定义一些试图对信息进行分类的规则。一些例子：

+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'

如果你有很多低得分的规则，当你找到更多相关的部分时，这些规则加起来，我认为这可以发展成一个相当强大和强大的技术。在

[EDIT2]看看它的可读性，它似乎完全符合我刚才的建议！也许可以改进一下，试着更好地理解表格？在

网友

2楼 · 编辑于 2024-05-16 18:15:48

看看templatemaker:http://www.holovaty.com/writing/templatemaker/

这是Django的创始人之一写的。基本上你可以用它来生成一些有意义的文件。在

下面是google code page中的一个示例：


# Import the Template class.
>>> from templatemaker import Template

# Create a Template instance.
>>> t = Template()

# Learn a Sample String.
>>> t.learn('<b>this and that</b>')

# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'

# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True

# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'

网友

3楼 · 编辑于 2024-05-16 18:15:48

您可以使用boilerpipe Web application动态获取和提取内容。在

（这不是Python特有的，因为您只需要向googleappengine上的页面发出httpget请求）。在

干杯

基督徒

相关问题更多 >

编程相关推荐

热门问题

热门文章