Python haufe.stripml包_程序模块 - PyPI

用于从字符串中删除标记的python扩展。

haufe.stripml的Python项目详细描述

包装haufe.stripml

用于从文本中剥离HTML标记的Python扩展。

这个包简单地从文本中删除类似html的标记粗暴的态度。它可以很容易地用于将xml或sgml从文本中剥离为好。它不做任何语法检查。

核心功能是用C++编程语言实现的，因此比使用sgmlparser或正则表达式任务。

版权

haufe.stripml是（C）Tobias Rodaebel&haufe Mediengruppe，德国弗莱堡，D-79111

许可证

此软件包在lgpl 3下发布，请参阅license.txt。

安装

使用简易安装：

easy_install haufe.stripml

测试

haufe.stripml可以通过键入：

python setup.py test -m haufe.stripml.tests

学分

感谢Gottfried Ganssauge将其翻译成C++程序设计语言。

首先我们要测试stripml方法。

>>> from haufe.stripml import stripml
>>> stripml.__doc__
'stripml(s) -> string'

唯一的参数是一个字符串。

>>> stripml('foo')
'foo'
>>> type(stripml('foo')) == type('')
True

stripml方法也支持unicode

>>> stripml(u'bar')
u'bar'
>>> type(stripml(u'foo')) == type(u'')
True

尝试将整数作为第一个参数。应引发类型错误。

>>> try:
...     stripml(10)
... except TypeError, strerror:
...     print strerror
String or unicode string required.

空脚本

>>> stripml ('<script>')
''
>>> stripml (u'<script>')
u''
>>> stripml ('<script></script>')
''
>>> stripml (u'<script></script>')
u''

尝试一些大型元素名

>>> stripml ('<some-very-long-element-name-longer-than-foreseeable>')
''
>>> stripml (u'<some-very-long-element-name-longer-than-foreseeable>')
u''

现在我们尝试一些愚蠢的HTML

>>> stripml('<b>foo</b>')
'foo'
>>> stripml('foo <i>bar</i>.')
'foo bar.'
>>> stripml('''<font size = 12><b>Really <i>big</i> string
... </b></font>''')
'Really big string\n'

…现在是Unicode。

>>> stripml(u'<b>foo</b>')
u'foo'
>>> stripml(u'foo <i>bar</i>.')
u'foo bar.'
>>> stripml(u'''<font size = 12><b>Really <i>big</i> string
... </b></font>''')
u'Really big string\n'

有时我们有脚本标记，没有人需要这些内容

>>> stripml('''We have a script in here <script language="JavaScript"
... type="text/javascript">alert('Hello, World!');</script>, dude.''')
'We have a script in here , dude.'

Unicode编码。

>>> stripml(u'''We have a script in here <script language="JavaScript"
... type="text/javascript">alert('Hello, World!');</script>, dude.''')
u'We have a script in here , dude.'

但另一方面，scrip-标签的内容（没有后面的't'）不应剥离

>>> stripml('<scrip>KEEP THIS</scrip>')
'KEEP THIS'
>>> stripml(u'<scrip>KEEP THIS</scrip>')
u'KEEP THIS'

而且也不应该脚本-标记

>>> stripml('<scripting>KEEP THIS</scripting>')
'KEEP THIS'
>>> stripml(u'<scripting>KEEP THIS</scripting>')
u'KEEP THIS'

忘记如何<；/脚本>；-标记

>>> stripml('KEEP <script>DO NOT KEEP THIS</script></script>THIS')
'KEEP THIS'
>>> stripml(u'KEEP <script>DO NOT KEEP THIS</script></script>THIS')
u'KEEP THIS'

一根更长的绳子。

>>> result = stripml(u'''
... <?xml version="1.0" encoding="utf-8"?>
... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
... <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
... <head>
... <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
... <meta name="generator" content="" />
... <meta name="keywords" content="" />
... <meta name="description" content="" />
... <title>Test document</title>
... <script language="JavaScript" type="text/javascript">
... var foo=1;
... function getFoo() {
...     return foo;
... }
... </script>
... </head>
... <body onLoad="alert('Hello, World!');">
...   <h1>Test document</h1>
...   <p>This document is<br /> <i>only for testing</i>!</p>
...   <script>getFoo();</script>
... </body>
... </html>
... ''')
>>> result.strip()
u'Test document\n\n\n\n  Test document\n  This document is only for testing!'
>>> type(result)
<type 'unicode'>

将通过单个“小于”或“大于”。

>>> stripml(u'<strong>hundred < thousand < million.</strong>')
u'hundred < thousand < million.'
>>> stripml(u'<strong>thousand > hundred.</strong>')
u'thousand > hundred.'
>>> stripml('<strong>hundred < thousand < million.</strong>')
'hundred < thousand < million.'
>>> stripml('<strong>thousand > hundred.</strong>')
'thousand > hundred.'

让我们看看一根很长的绳子是否能处理好。

>>> s = 5000 * u'<p>This is <span>a span within a paragraph.</span><!-- And this is a comment --></p>\n'
>>> stripml(s) == 5000 * u'This is a span within a paragraph.\n'
True

我们应该看看实体和编码。

>>> stripml(u'In Stra&szlig;e und &Uuml;berf&uuml;hrung haben wir Umlaute.')
u'In Stra&szlig;e und &Uuml;berf&uuml;hrung haben wir Umlaute.'
>>> stripml('In Stra&szlig;e und &Uuml;berf&uuml;hrung haben wir Umlaute.')
'In Stra&szlig;e und &Uuml;berf&uuml;hrung haben wir Umlaute.'
>>> print stripml(u'In Straße und Überführung haben wir Umlaute.').encode('ISO-8859-1') == u'In Straße und Überführung haben wir Umlaute.'.encode('ISO-8859-1')
True

更改

1.2.2（2012-11-07）

M.Honeck对Visual Studio 2010的支持

1.2.1（2008-03-20）

增加了机头测试转轮支持

1.2.0（2007-10-23）

首次公开发行。
已添加许可证。

欢迎加入QQ群-->： 979659372

haufe.stripml 1.2.2

haufe.stripml的Python项目详细描述

包装haufe.stripml

版权

许可证

安装

测试

学分

更改

1.2.2（2012-11-07）

1.2.1（2008-03-20）

1.2.0（2007-10-23）

推荐PyPI第三方库

saiti

aiosocks

spatial-point-manager

odoo10-addon-stock-change-price-at-date

python3-hwloc

python-pit

pulsedive

girder-geobrowser

mqfactor

easyshop.stocks

sphinx-rigado-theme

kinto-portier

with-venv

Products.AROfficeTransforms

whizkers

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

haufe.stripml 1.2.2

haufe.stripml的Python项目详细描述

包装haufe.stripml

版权

许可证

安装

测试

学分

更改

1.2.2（2012-11-07）

1.2.1（2008-03-20）

1.2.0（2007-10-23）

推荐PyPI第三方库

saiti

aiosocks

spatial-point-manager

odoo10-addon-stock-change-price-at-date

python3-hwloc

python-pit

pulsedive

girder-geobrowser

mqfactor

easyshop.stocks

sphinx-rigado-theme

kinto-portier

with-venv

Products.AROfficeTransforms

whizkers

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签