从具有挑战性的网站刮信息没有指导性的HTML结构

2024-06-26 11:13:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从一个非常有挑战性的网站上搜集一些信息

这是一个例子:

<div class="overview">
        <span class="course_titles">Courses:</span> 
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17, 
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ), 

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12); 
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16, 
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17, 
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ), 
</div>

每门课程都有特定的学生,他们的年龄以他们的名字命名(那些随机的字符已经在里面了)。你知道吗

我需要与他们各自的学生,加上年龄刮每门课程。你知道吗

不幸的是,除了包罗万象的div类之外,没有固有的层次结构。我试着用BeautifulSoup按“coursestudent\u name”刮,然后添加所有具有“coursestudent\u name”属性的项目,但这样我就添加了每个课程的所有学员。你知道吗

我希望我能改变网站,但我不能。有人知道我怎样才能得到正确的学生每门课的信息吗?你知道吗

谢谢你!你知道吗


Tags: namediv信息网站student学生class课程
3条回答

你可以用一点regex来获得学生年龄,而不是任何html标记

soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")

classInfo = {}
currentClass = None
for item in allA:
    if item['class'] == ['course_name']:
        classInfo[item.text] = []
        currentClass = item.text
    else:
        classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]


print(classInfo)

这将输出:

{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}

如果你能修改你的问题,让我们知道你到底在找什么。但是,这里有一个基本的示例,说明如何从这个页面获取数据。你知道吗

from bs4 import BeautifulSoup
import re

html = '''<div class="overview">
        <span class="course_titles">Courses:</span> 
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17, 
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ), 

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12); 
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16, 
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17, 
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ), 
</div>'''

soup = BeautifulSoup(html)

all_links = soup.find_all('a')

dict_courseinfo = {}
dict_key = ''
stu_lst = []

for n, link in enumerate(all_links):
    if link.get('class')[0] == 'course_name':
        if n > 0:
            dict_courseinfo[dict_key] = stu_lst
            stu_lst = []
        dict_key = str(link.text)
    else:
        age = int(re.search(link.text + r"</a> (\d+)", html).group(1))
        stu_lst.append((str(link.text), age))

dict_courseinfo[dict_key] = stu_lst

print dict_courseinfo

将输出:

{'Math101': [('Mark', 17), ('Alex', 18)], 'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)]}

您不需要正则表达式,只需解析锚定标记即可获得名称,并调用next_sibling来获得年龄文本拆分和剥离来获得年龄文本,找到coursestudent之前的course_name也将为您提供相关课程:

h = """<div class="overview">
        <span class="course_titles">Courses:</span>
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>"""

from bs4  import BeautifulSoup
soup = BeautifulSoup(h)


data = [[a.find_previous("a", "course_name").text ,a.text, a.next_sibling.split()[0].strip(",")] for a in soup.select("div.overview a.coursestudent_name")]

 [[u'Math101', u'Mark', u'17'], [u'Math101', u'Alex', u'18'], [u'English101', u'Sarah', u'16'], [u'English101', u'Nancy', u'17'], [u'English101', u'Casey', u'17']]

相关问题 更多 >