如何在html文件中执行与标记无关的文本字符串搜索？

2条回答

网友

1楼 · 编辑于 2024-10-02 04:36:00

{cd1>因为这个选项而被弃用。正确的解决方案是自己删除标记，但保留位置，这样您就有一个映射来更正从LT返回的结果。当从Java使用LT时，AnnotatedText支持这一点，但是算法应该足够简单，可以移植它。（全面披露：我是LT的维护者）

网友

2楼 · 编辑于 2024-10-02 04:36:00

这可能不是最快的方法，但是pyparsing可以识别大多数形式的HTML标记。下面的代码反转典型的扫描，创建一个匹配任何单个字符的扫描仪，然后配置扫描仪跳过HTML打开和关闭标记，以及常见的HTML '&xxx;'实体。pyparsing的scanString方法返回一个生成器，该生成器生成匹配的标记、每个匹配的开始和结束位置，因此很容易构建一个列表，将标记之外的每个字符映射到其原始位置。从这里开始，剩下的几乎就是''.join并索引到列表中。请参见下面代码中的注释：

test = "<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"

from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity

non_tag_text = Word(printables+' ',  exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)

# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]

# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)

# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]

# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')

"""
Should look like this:

<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
                 ^
"""

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在html文件中执行与标记无关的文本字符串搜索？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >