如果一个站点有两个链接,一个是我想要的,另一个是我不想要的,如何从中提取一个特定的链接?

2024-09-29 01:21:11 发布

您现在位置:Python中文网/ 问答频道 /正文

<td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl01$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_1" value="866 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_1" href="CollegeDetailedInformation.aspx?Inst=866 ">A.N.A INSTITUTE OF PHARMACEUTICAL SCIENCES & RESEARCH,BAREILLY (866)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_1">13.5 km Bareilly - Delhi road, near rubber factory agras road ,Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_1">B.Pharm,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_1">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='' target="_blank"></a> <br /> </td> </tr> <tr> <td> <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl02$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_2" value="486 " /> <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_2" href="CollegeDetailedInformation.aspx?Inst=486 ">A.N.A.COLLEGE OF ENGINEERING & MANAGEMENT,BAREILLY (486)</a> <br /> <b>Location:</b> <span id="ContentPlaceHolder1_dlstCollege_lblAddress_2">13.5 Km. NH-24, Bareilly-Delhi Highway, Near Rubber Factory, Bareilly</span> <br /> <b>Course:</b> <span id="ContentPlaceHolder1_dlstCollege_lblCourse_2">B.Tech,M.Tech,</span> <br /> <b>Category:</b> <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_2">Private</span> <br /> <b>Web Address:</b> <a id="lnkBtnWebURL" href='http://www.anacollege.org/index.html' target="_blank">http://www.anacollege.org/index.html</a> <br /> </td> </tr>

我想从这个网站提取一个特定的URL(例如:CollegeDetailedInformation.aspx?Inst=866),但是这个代码有两个标记,其中一个我不想要(例如:http://www.anacollege.org/index.html


res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

pagelink = []
for anchor in table.findAll('a')[1:]:
        link = anchor['href']
        print(link)
        url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
        pagelink.append(url)
print(pagelinks)

我写了这段代码,但它正在提取所有链接

CollegeDetailedInformation.aspx?Inst=486  
http://www.anacollege.org/index.html
CollegeDetailedInformation.aspx?Inst=602  
http://www.aashlarbschool.com
CollegeDetailedInformation.aspx?Inst=032  
http://www.abes.ac.in
CollegeDetailedInformation.aspx?Inst=290  
http://www.abesit.in
CollegeDetailedInformation.aspx?Inst=913  
http://www.abesitpharmacy.in
CollegeDetailedInformation.aspx?Inst=643  
http://www.vitsald.com
CollegeDetailedInformation.aspx?Inst=1036 
http://www.abss.edu.in

如何解决这个问题?我只需要CollegeDetailedInformation.aspx?Inst=?部分


Tags: inbridhttphtmlwwwtabletd
3条回答

您可以使用CSS selector并使用它查找所有需要的链接a[href*=CollegeDetailedInformation]

import requests
from bs4 import BeautifulSoup

res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

allAnchor = table.select("a[href*=CollegeDetailedInformation]")

pagelink = []
for anchor  in allAnchor:
    link = anchor['href']
    # print(link)
    url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
    pagelink.append(url)

print(pagelink)

输出将是:

['https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=968  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=866  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=486  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=602  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=032  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=290  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=913  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=643  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=1036 ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=312  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=986  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=686  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=805  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=225  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=799  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=041  ',
'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=952  ',

and so on....
]

作为查看学院详细信息的链接的锚元素有一个id属性,该属性以ContentPlaceHolder1_dlstCollege_开头。因此,将其作为{}的{a1}传递:

import re

for anchor in table.findAll('a', attrs={"id": re.compile("^ContentPlaceHolder1_dlstCollege_.*")}):
    ...

您也可以将其作为^{} keyword argument传递给find_all()

for anchor in table.findAll('a', id=re.compile("^ContentPlaceHolder1_dlstCollege_.*")):
    ...

正则表达式可以变得更加具体,比如"^ContentPlaceHolder1_dlstCollege_hlpkInstituteName_.*",它应该只匹配学院名称提供的链接

(我会删除你放在末尾的[1:],因为这可能会在开始时过滤掉你不想要的链接。如果不需要,那么就把它重新添加进去。)

我不知道Python,但一般的规则是在for循环中填充一个数组,然后查找包含过滤器的子字符串,选择索引并获取该索引中的所有内容

Initialize and empty array outside loop (if empty is allowed in Python), populate it in the loop, then do something like in_array (for php) for your filter: CollegeDetailedInformation.aspx?Inst=?.

这应该是一个很好的开始,因为Python的大师们会提供帮助

相关问题 更多 >