如果在同一HTML树下有多个“title”属性，我将如何刮取这些属性？

import re, requests from bs4 import BeautifulSoup nyaa_link = 'https://nyaa.si/' request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'}) source = request.content soup = BeautifulSoup(source, 'lxml') #GETTING TORRENT NAMES title = [] rows = soup.findAll("td", colspan="2") for row in rows: title.append(row.content) #GETTING MAGNET LINKS magnets = [] for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}): magnets.append(link.get('href')) print(magnets)

2条回答

网友

1楼 · 编辑于 2024-09-27 19:32:39

您需要从表基准中的链接中提取标题。因为这里的每个<td>都包含一个<a>，所以只需调用td.find('a')['title']

import re, requests
from bs4 import BeautifulSoup

nyaa_link = 'https://nyaa.si/'
request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'})
source = request.content
soup = BeautifulSoup(source, 'lxml')

#GETTING TORRENT NAMES
title = []
rows = soup.findAll("td", colspan="2")
for row in rows:
#UPDATED CODE
    desired_title = row.find('a')['title']
    if 'comment' not in desired_title:
        title.append(desired_title)

#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
    magnets.append(link.get('href'))
print(magnets)

网友

2楼 · 编辑于 2024-09-27 19:32:39

所以我已经解决了问题，找到了解决办法

问题是这一行：if 'comment' not in desired_title:

它只处理不包含“注释”的HTML。问题是我试图抓取页面上HTML结构的方式，基本上，如果torrent对它有评论，它将显示在HTML结构上，高于标题名。因此，我的代码将完全跳过带有注释的torrents

以下是一个可行的解决方案：

import re, requests
from bs4 import BeautifulSoup

nyaa_link = 'https://nyaa.si/?q=test'
request = requests.get(nyaa_link)
source = request.content
soup = BeautifulSoup(source, 'lxml')

#GETTING TORRENT NAMES
title = []
n = 0
rows = soup.findAll("td", colspan="2")
for row in rows:
    if 'comment' in row.find('a')['title']:
        desired_title = row.findAll('a', title=True)[1].text
        print(desired_title)
        title.append(desired_title)
        n = n+1
    else:
        desired_title = row.find('a')['title']
        title.append(desired_title)
        print(row.find('a')['title'])
        print('\n')
#print(title)

#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
    magnets.append(link.get('href'))
#print(magnets)

#GETTING NUMBER OF MAGNET LINKS AND TITLES
print('Number of rows', len(rows))
print('Number of magnet links', len(magnets))
print('Number of titles', len(title))
print('Number of removed', n)

感谢CannedScientist提供解决方案所需的一些代码

相关问题更多 >

编程相关推荐

热门问题

热门文章