<p>这可能不是最快的方法,但是pyparsing可以识别大多数形式的HTML标记。下面的代码反转典型的扫描,创建一个匹配任何单个字符的扫描仪,然后配置扫描仪跳过HTML打开和关闭标记,以及常见的HTML <code>'&xxx;'</code>实体。pyparsing的<code>scanString</code>方法返回一个生成器,该生成器生成匹配的标记、每个匹配的开始和结束位置,因此很容易构建一个列表,将标记之外的每个字符映射到其原始位置。从这里开始,剩下的几乎就是<code>''.join</code>并索引到列表中。请参见下面代码中的注释:</p>
<pre><code>test = "<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
</code></pre>