<p>Regex是安全的,如果你知道HTML提供者,里面的代码是什么样子的。在</p>
<p>然后,只需使用交替和命名的捕获组。在</p>
<pre><code>telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
</code></pre>
<p>见<a href="https://regex101.com/r/mK0lF7/5" rel="nofollow">demo</a></p>
<p>如果<code>></code>未序列化,可以使用以下regex(更通用的regex,<strong>edit</strong>:现在,详细说明):</p>
^{pr2}$
<p><a href="http://ideone.com/wh44E7" rel="nofollow">Sample demo on IDEONE</a></p>
<p>粘贴regex代码部分:</p>
<pre><code>p = re.compile(ur'''telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE)
test_str = "YOUR STRING"
print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Address") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
</code></pre>
<p>输出(加倍的结果是我用不同的节点顺序复制输入字符串的结果):</p>
<pre><code>[u'9440717256', u'9440717256']
[u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony']
[u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']
</code></pre>