<td>
<input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl01$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_1" value="866 " />
<a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_1" href="CollegeDetailedInformation.aspx?Inst=866 ">A.N.A INSTITUTE OF PHARMACEUTICAL SCIENCES & RESEARCH,BAREILLY (866)</a>
<br />
<b>Location:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblAddress_1">13.5 km Bareilly - Delhi road, near rubber factory agras road ,Bareilly</span>
<br />
<b>Course:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblCourse_1">B.Pharm,</span>
<br />
<b>Category:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_1">Private</span>
<br />
<b>Web Address:</b>
<a id="lnkBtnWebURL" href='' target="_blank"></a>
<br />
</td>
</tr>
<tr>
<td>
<input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl02$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_2" value="486 " />
<a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_2" href="CollegeDetailedInformation.aspx?Inst=486 ">A.N.A.COLLEGE OF ENGINEERING & MANAGEMENT,BAREILLY (486)</a>
<br />
<b>Location:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblAddress_2">13.5 Km. NH-24, Bareilly-Delhi Highway, Near Rubber Factory, Bareilly</span>
<br />
<b>Course:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblCourse_2">B.Tech,M.Tech,</span>
<br />
<b>Category:</b>
<span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_2">Private</span>
<br />
<b>Web Address:</b>
<a id="lnkBtnWebURL" href='http://www.anacollege.org/index.html' target="_blank">http://www.anacollege.org/index.html</a>
<br />
</td>
</tr>
我想从这个网站提取一个特定的URL(例如:CollegeDetailedInformation.aspx?Inst=866),但是这个代码有两个标记,其中一个我不想要(例如:http://www.anacollege.org/index.html)
res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')
table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
pagelink = []
for anchor in table.findAll('a')[1:]:
link = anchor['href']
print(link)
url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
pagelink.append(url)
print(pagelinks)
我写了这段代码,但它正在提取所有链接
CollegeDetailedInformation.aspx?Inst=486
http://www.anacollege.org/index.html
CollegeDetailedInformation.aspx?Inst=602
http://www.aashlarbschool.com
CollegeDetailedInformation.aspx?Inst=032
http://www.abes.ac.in
CollegeDetailedInformation.aspx?Inst=290
http://www.abesit.in
CollegeDetailedInformation.aspx?Inst=913
http://www.abesitpharmacy.in
CollegeDetailedInformation.aspx?Inst=643
http://www.vitsald.com
CollegeDetailedInformation.aspx?Inst=1036
http://www.abss.edu.in
如何解决这个问题?我只需要CollegeDetailedInformation.aspx?Inst=?部分
您可以使用CSS
selector
并使用它查找所有需要的链接a[href*=CollegeDetailedInformation]
输出将是:
作为查看学院详细信息的链接的锚元素有一个}的{a1}传递:
id
属性,该属性以ContentPlaceHolder1_dlstCollege_
开头。因此,将其作为{您也可以将其作为^{} keyword argument 传递给
find_all()
:正则表达式可以变得更加具体,比如
"^ContentPlaceHolder1_dlstCollege_hlpkInstituteName_.*"
,它应该只匹配学院名称提供的链接(我会删除你放在末尾的
[1:]
,因为这可能会在开始时过滤掉你不想要的链接。如果不需要,那么就把它重新添加进去。)我不知道Python,但一般的规则是在for循环中填充一个数组,然后查找包含过滤器的子字符串,选择索引并获取该索引中的所有内容
这应该是一个很好的开始,因为Python的大师们会提供帮助
相关问题 更多 >
编程相关推荐