无法从一些相同的链接中解析外观怪异的网站地址

2024-10-06 12:15:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从一些相同的网页中提取网址。我创建了一个正则表达式来解析相同的模式,但我定义的模式无疑是最糟糕的。如何仅从位于post-content类下的p标记内的网页获取网站地址?。你知道吗

我试过:

import re
import requests
from bs4 import BeautifulSoup

links = [
    'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
    'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
    'https://colegios.es/2012/ceip-las-canadas-trescasas/',
    'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]

def get_website(link):
    res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text,"html5lib")
    text = soup.select_one('.post-content > p').get_text(strip=True, separator='\n')
    website = re.findall(r'\s+(.*)\n\[', text)[0]
    print(website)

if __name__ == '__main__':
    for link in links:
        get_website(link)

我得到的结果是:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra
Dirección: Las Pozas, 17 40194 Trescasas Segovia
Tel. 920 383 556 05005887@educa.jcyl.es   centros1.pntic.mec.es/cp.el.barranco

预期结果:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra

centros1.pntic.mec.es/cp.el.barranco

Tags: texthttpsimport网页geteslinkwebsite
1条回答
网友
1楼 · 发布于 2024-10-06 12:15:18

我相信很快就会打破这一局面

import re
import requests
from bs4 import BeautifulSoup

links = [
    'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
    'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
    'https://colegios.es/2012/ceip-las-canadas-trescasas/',
    'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]

def get_website(link):
    res = s.get(link,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text,"html5lib")
    y = str(soup.select_one('.post-content p')).split('<br/>')[-2]
    if 'Dirección' not in y:
        y = re.sub(r'\s{2,}', ' ', y).strip()
        website = y.split(' ')[-1]
        print(website)

if __name__ == '__main__':
    with requests.Session() as s:
        for link in links:
            get_website(link)

相关问题 更多 >