python中的可选单词regex

2条回答

网友

1楼 · 编辑于 2024-10-16 20:51:26

考虑到您的输入不是有效的HTML并且它可能会发生更改，您可以使用类似BeautifulSoup的HTML解析器。这些简单的选择符将适应这些输入。在

from bs4 import BeautifulSoup

h = """<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality"><a href="/hyderabad/lal-bahadur-nagar/allcategory.aspx" title="**Pages**">Lal Bahadur Nagar</a></span>"""
soup = BeautifulSoup(h)

编辑：既然您现在告诉我们您需要具有指定属性值的元素的文本，那么您可以use a function as filter。在

^{pr2}$
输出：
9440717256 H.No. 3-11-62, RTC Colony Lal Bahadur Nagar

网友
2楼 · 编辑于 2024-10-16 20:51:26

Regex是安全的，如果你知道HTML提供者，里面的代码是什么样子的。在
然后，只需使用交替和命名的捕获组。在
telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
见demo
如果>未序列化，可以使用以下regex（更通用的regex，edit：现在，详细说明）：
^{pr2}$
Sample demo on IDEONE
粘贴regex代码部分：
p = re.compile(ur'''telephone[^<]*> # Looking for telephone (?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag) | streetAddress[^<]*> # Looking for streetAddress (?P<Address>[^<]+) # Capture address (all text up to the next tag) | Pages[^<]*> # Looking for Pages (?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE) test_str = "YOUR STRING" print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)]) print filter(None, [x.group("Address") for x in re.finditer(p, test_str)]) print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
输出（加倍的结果是我用不同的节点顺序复制输入字符串的结果）：
[u'9440717256', u'9440717256'] [u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony'] [u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']

相关问题更多 >

编程相关推荐

热门问题

热门文章