如何用beauthoulsoupbs4抓取html标签(我不想要文本)

2024-10-03 09:18:09 发布

您现在位置:Python中文网/ 问答频道 /正文

<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">

这个HTML是我的目标。我想爬这条线

^{pr2}$

我特别需要这条线

"Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği"

我编写了这个函数,但没有返回任何函数

 def read_tags(self, news_url):
        try:

            self.checkRequests(news_url)
            tag = self.soup.find("div", {'class':'tioTrivia lightblue bottomRight show sticky'})
            if tag:
                tag = tag.get_text().encode(encoding='utf-8')
                return tag.lower()
            return
        except Exception, e:
            self.insertErrorLog('ntvspor.read_title', news_url, e)

Tags: selfdivurldatatagshowlightblueclass
2条回答

简单到:

for t in soup.select('.tioTrivia'):
    print t.get('data-article-id')

在代码和示例HTML中,tag.get_text()返回一个空字符串,因为div标记中没有内部文本。在

为什么不直接从匹配的标记中获取data-article-id属性的值呢?在

from bs4 import BeautifulSoup

soup = BeautifulSoup('''<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">''')
data = soup.find('div', class_='tioTrivia').get('data-article-id', '')
data = data.encode('utf8')

>>> data
'Tivibu,Man\xc5\x9fet,Futbol,Futbol,Spor Toto S\xc3\xbcper Lig,Be\xc5\x9fikta\xc5\x9f,Gen\xc3\xa7lerbirli\xc4\x9fi'
>>> print data
Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği

另外,您不需要为class属性指定所有值。在这种情况下,tioTrivia应该足够了,因为其他(lightblue bottomRight show sticky)是表示性的,而不是{a1}。在

相关问题 更多 >