我可以用python使用selenium在a标记下的b标记中获取数据吗?

2024-09-28 19:31:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我可以用python使用selenium在a标记下的b标记中获取数据吗

如果可以,怎么做?你能告诉我解决办法吗

这是html的结构

...
<div class = "cont_inner">
  <div class = "wrap_tit_ mg_tit">
    <a href = "href="https://cp.news.search.daum.net/p/97048679" class"f_link_b" onclick="smartLog(this, "dc=NNS&d=26DQnlvsWTMHk5CtBf&pg=6&r=2&p=4&rc=10&e1=163cv75CcAF31EvlGD&e3=0&ext=dsid=26DQnlvsWTMHk5CtBf", event, {"cpid": {"value": "163cv75CcAF31EvlGD"}});" target = "_blank"> == $0

        "하남지역자활센터,"
        <b>보건복지부</b>
        "간이평가 우수기관"
    </a>
</div>

我想得到像这样的数据


"하남지역자활센터, 보건복지부 간이평가우수기관"

这是我的代码状态

[['"하남지역자활센터, , 간이평가 우수기관"']]

这是我在网站上抓取数据的源代码

class crwaler_daum:
    def __init__(self):
        self.title = []
        self.body = []
        self.url = input("please enter url for crawling data")
        self.page = input('please enter number of page to get data')
    
    def get_title(self):
        return self.title
    
    def set_title(self , title):
        self.title.append(title)
        
    def get_body(self):
        return self.body
    
    def set_body(self , body):
        self.body.append(body)
    
    def crwaling_title(self):
        title_list = []
        chrome_driver = webdriver.Chrome('D:/바탕 화면/인턴/python/crwaler/news_crawling/chromedriver.exe')
        url = self.url
        response = requests.get(url , verify = False)
        root = lxml.html.fromstring(response.content)
        chrome_driver.get(url)
        
        for i in range(int(self.page) + 1):
            for j in root.xpath('//*[@id="clusterResultUL"]/li'):
                title_list.append((j.xpath('div[2]/div/div[1]/a/text()')))
                
        print(title_list)
            
            chrome_driver.get('https://search.daum.net/search?w=news&DA=PGD&enc=utf8&cluster=y&cluster_page=3&q=%EB%B3%B4%EA%B1%B4%EB%B3%B5%EC%A7%80%EB%B6%80&p={}'.format(i))


Tags: selfdivurlforsearchgettitledef
2条回答

lxml有一个内置函数“.text_content()”,它“返回元素的文本内容,包括其子元素的文本内容,不带任何标记。”。但是在使用这个函数之后,您应该像您想要的那样操纵字符串来获取它。我希望通过下面的代码您能更好地理解我的意思,但它可能不太实用,因为我也是Python的初学者,但它现在解决了这个问题

import lxml.html

html = '''
<div class = "cont_inner">
    <div class = "wrap_tit_ mg_tit">
        <a href = "href="https://cp.news.search.daum.net/p/97048679" class"f_link_b" onclick="smartLog(this, "dc=NNS&d=26DQnlvsWTMHk5CtBf&pg=6&r=2&p=4&rc=10&e1=163cv75CcAF31EvlGD&e3=0&ext=dsid=26DQnlvsWTMHk5CtBf", event, {"cpid": {"value": "163cv75CcAF31EvlGD"}});" target = "_blank">
            "하남지역자활센터,"
            <b>보건복지부</b>
            "간이평가 우수기관"
        </a>
</div>'''


my_html = lxml.html.fromstring(html)
a_element = my_html.xpath('//div[@class="wrap_tit_ mg_tit"]/a')
print(a_element[0].text_content())


def prettify_string(string):
    string = string.replace("\n", "").replace("\"", "").split(" ")
    while "" in string:
        string.remove("")
    string = " ".join(string)
    return string

"""
Prints:

            "하남지역자활센터,"
            보건복지부
            "간이평가 우수기관"
        
"""


print(prettify_string(str(a_element[0].text_content())))

"""
Prints:
하남지역자활센터, 보건복지부 간이평가 우수기관
"""

我还没有使用lxml爬虫程序,但您可以使用BeautifulSoup

from bs4 import BeautifulSoup

chrome_driver = webdriver.Chrome('your direction')

chrome_driver.get('your url')

html = chrome_driver.page_source

soup = BeautifulSoup(html, 'html.parser')
b_tag = soup.find_all('b')

相关问题 更多 >