从源代码中提取带有regex的链接；Python

网友

1楼 · 编辑于 2024-09-28 17:03:36

使用BeautifulSoup查找匹配的内容属性，然后将其替换为：

from bs4 import BeautifulSoup
import re

html = """
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece" />
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html" />
"""

soup = BeautifulSoup(html)
# reference table of url prefixes to full html link
html_links = {
    el['content'].rpartition('/')[0]: el['content'] 
    for el in soup.find_all('meta', content=re.compile('.html$'))
}
# find all ece links, strip the end of to match links, then adjust
# meta content with looked up element
for el in soup.find_all('meta', content=re.compile('.ece$')):
    url = re.sub('(?:article(\d+).ece$)', r'\1', el['content'])
    el['content'] = html_links[url]

print soup
# <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"/>

网友

2楼 · 编辑于 2024-09-28 17:03:36

(.*?)(http:\/\/.*\/.*?\.)(ece)

试试看这个。换掉通过$2html。你知道吗

请参见演示。你知道吗

http://regex101.com/r/nA6hN9/24

网友

3楼 · 编辑于 2024-09-28 17:03:36

这里有一个非常简单的正则表达式让你开始。你知道吗

This one将提取所有链接

\<meta content="(http:\/\/www\.telegraaf\.nl.*)"

这个将匹配所有的html链接

\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"

要将其与您所拥有的一起使用，您可以执行以下操作：

import urllib2
import re

replacements = dict()
for url in ece_url_list:
    response = urllib2.urlopen(url)
    html = response.read()
    replacements[url] = re.findall('\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"', html)[0]

注意：这假设每个源代码页在这个meta标记中总是包含一个html链接。它只期望一个。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

从源代码中提取带有regex的链接；Python

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >