怎么从这个链接中抓取子标题？

2024-06-28 15:18:32 发布

您现在位置：Python中文网/ 问答频道 /正文

5995

网友

男 | 程序猿一只，喜欢编程写python代码。

我制作了一个web scraper，它可以从如下所示的页面中刮取数据（它刮取表）：https://www.techpowerup.com/gpudb/2/

问题是，出于某种原因，我的程序只删除了值，而不是子标题。例如，（点击链接），它只刮取“R420”、“130nm”、“160000000”等，而不刮取“GPU名称”、“进程大小”、“晶体管”等

我应该向代码中添加什么来让它删除副标题？这是我的密码：

import csv
import requests
import bs4
url = "https://www.techpowerup.com/gpudb/2"


#obtain HTML and parse through it
response = requests.get(url)
html = response.content
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")

#reading every value in every row in each table and making a matrix 
tableMatrix = []
for table in tables:
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    tableMatrix.append((list_of_rows, list_of_cells))

#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list 
placeHolder = 0
excelTable = []
for table in tableMatrix:
    for row in table:
        if placeHolder == 0:
            for entry in row:
                excelTable.append(entry)
            placeHolder = 1
        else:
            placeHolder = 0
    excelTable.append('\n')

for value in excelTable:
    print value
    print '\n'


#create excel file and write the values into a csv 
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:
    writer.writerow(values)
fl.close()

Tags： and of csv in import for sys table

1条回答

网友

1楼 · 发布于 2024-06-28 15:18:32

如果检查页面源，则这些单元格是标题单元格。所以他们不使用TD标签，而是TH标签。您可能需要更新循环，将TH单元格与TD单元格一起包含。你知道吗

怎么从这个链接中抓取子标题？

相关问题更多 >

编程相关推荐

热门问题

热门文章

怎么从这个链接中抓取子标题？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >