字符串列表上的Python正则表达式

2024-10-01 15:47:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从字符串列表中提取一个url。样品清单:

import re
p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']

我想提取src=“part”后面的http://www.sample.com/test.jpg部分。在

如果p只是一个字符串,我可以使用findall:

^{pr2}$

但是我如何遍历这个列表并返回p中所有url的列表呢?在


Tags: sample字符串testsrccomhttpurl列表
3条回答

这是一个使用BeautifulSoup的解决方案:

>>> p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(''.join(p), 'html.parser')
>>> src_links = [img['src'] for img in soup.find_all('img')]

>>> src_links
[u'http://www.sample.com/test.jpg', u'http://www.sample.com/test2.jpg']

如果要使用regex:

^{pr2}$

这是你想要的吗?在

import re
p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']
outList = [re.findall('src="(.+)" alt', pp)[0] for pp in p]

循环操作怎么样:

>>> pe = re.compile('src="(.+)" alt')
>>> for img in p:
...     print pe.findall(img)
... 
['http://www.sample.com/test.jpg']
['http://www.sample.com/test2.jpg']

相关问题 更多 >

    热门问题