如何在这个html中选择一个特定的标记?

2024-06-03 03:36:38 发布

您现在位置:Python中文网/ 问答频道 /正文

如何选择此页中的所有标题

http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext

例如:我试图得到与此类似的所有行:

AFAS C1001 Introduction to African-American Studies. 3 points.

主页面从这里遍历了所有的学校课程,所以我可以抓取上面的所有标题:

http://bulletin.columbia.edu/columbia-college/departments-instruction/  

for page in main_page:
    sub_abbrev = page.find("div", {"class": "courseblock"})

我有这个代码,但我不知道如何选择第一个孩子的所有('strong')标记。 使用最新的python和beautiful soup 4进行web刮取。 如果还有什么需要的话。 谢谢


Tags: http标题pageamericaneduinstructiondepartmentsstudies
1条回答
网友
1楼 · 发布于 2024-06-03 03:36:38

courseblock类迭代元素,然后,对于每个过程,用courseblocktitle类获取元素。使用^{} and ^{} methods的工作示例:

import requests
from bs4 import BeautifulSoup


url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

for course in soup.select(".courseblock"):
    title = course.select_one("p.courseblocktitle").get_text(strip=True)
    print(title)

印刷品:

AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.

来自@double\u j的一个很好的后续问题:

In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.

我考虑过使用^{} methodseparator参数,但这也会在最后一个点之前增加一个额外的空间。相反,我将通过str.join()加入strong元素文本:

for course in soup.select(".courseblock"):
    title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
    print(title)

相关问题 更多 >