如何从google scholar websi抓取特定标签的所有子标签

2024-09-28 17:30:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从googlescholar的个人资料中获取数据,比如说this site

在浏览网站的时候。我想用类gsc_vcd_value抓取div标记中所有合著者的名字,但我不能直接这样做,所以我试着按顺序进行。我真正的问题是,在id为gs_md_cita-d-bdydiv标记之前,我能够抓取特定标记中包含的所有内容,即(所有子标记)。但之后当我尝试对id为gs_md_cita-ldiv标记执行相同操作时,我只得到标记本身作为回报。我没有得到孩子们的标签作为回报,我只得到标签本身。请告诉我我错过了什么?你知道吗

当我试着打印r\u标签时

<div class="gs_md_bdy" id="gs_md_cita-d-bdy"><style>#gs_md_cita-  

d{width:90%;max-width:1000px;}.gs_el_ph #gs_md_cita-d{width:100%;max-    

width:none;}#gs_md_cita-d .gs_md_prg{min-height:600px;}#gs_md_cita-

title,#gs_md_cita-b-edit,#gs_md_cita-b-trash,#gs_md_cita-

b-upload,#gs_md_cita-b-rstr,#gs_md_cita-b-delf,#gs_md_cita-

b-save{display:none;}.gs_md_cita-view #gs_md_cita-b-edit,.gs_md_cita-        view 

#gs_md_cita-b-trash,.gs_md_cita-view.gs_md_cita-allow_upload            #gs_md_cita- b-upload,.gs_md_cita-upload #gs_md_cita-title,.gs_md_cita-trash #gs_md_cita-

b-rstr,.gs_md_cita-trash #gs_md_cita-b-delf,.gs_md_cita-edit #gs_md_cita-

b-save{display:inline-block;}#gs_md_cita-b-trash,#gs_md_cita-

b-upload,#gs_md_cita-b-delf{margin-left:16px;}</style><div aria-

live="assertive" id="gs_md_cita-l"></div></div>...

以此类推,基本上标签中的所有内容,比如子标签等等 但当我试着打印s\ U标签时

[<div aria-live="assertive" id="gs_md_cita-l"></div>]

[<div aria-live="assertive" id="gs_md_cita-l"></div>]

[<div aria-live="assertive" id="gs_md_cita-l"></div>]

每次迭代只显示标记值。你知道吗

import bs4
import urllib
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

import urllib.request


class Scraper():
    def __init__(self, url, maxP):
        self.url= url
        self.maxP = maxP
    def f(self):
        for i in range(0,1000,100):
            if (self.maxP<i):
                pageSize=i

        for j in range(0, pageSize, 100):

            S_url=self.url + "&cstart=" + str(j) +"&pagesize=100"

            my_url = uReq(S_url)
            page_html = my_url.read()
            my_url.close()

            page_soup = soup(page_html, "lxml")

            aTag = page_soup.findAll('td', {'class': 'gsc_rsb_std'})

            Titles = page_soup.findAll('td', {'class': 'gsc_a_t'})

            Citations = page_soup.findAll('td', {'class': 'gsc_a_c'})

            Years = page_soup.findAll('td', {'class': 'gsc_a_y'})

            info_page = page_soup.findAll('a', {'class' : 'gsc_a_at'})

            for author in info_page:
                Author_names_link = author["data-href"]
                user=Author_names_link[53:65]
                n_input=Author_names_link[-12:]

                n_author_url="https://scholar.google.com.au
/citations?user="+user+"&hl=en#d=gs_md_cita-
  d&u=%2Fcitations%3Fview_op%3Dview_citation%26hl%3Den%26user%3D"+user+"%26cit
ation_for_view%3D"+user+"%3A"+n_input+"%26tzom%3D-330"

                author_url=uReq(n_author_url)

                n_page=author_url.read()

                author_url.close()

                n_page_soup=soup(n_page, "html.parser")

                n_tag=n_page_soup

                m_tag=n_tag.findAll('div', {'id': 'gs_top'})

                for i in m_tag:

                    p_tag=i.findAll('div', {'data-h': '800'})

                    for j in p_tag:

                        q_tag=j.findAll('div', {'id': 'gs_md_cita-d'})

                        for k in q_tag:

                            r_tag=k.findAll('div', {'id': 'gs_md_cita-            d-bdy'})

                            for l in r_tag:

                                s_tag=l.findAll('div', {'id':     'gs_md_cita-l'})

Tags: in标记divgsidurlfortag