对表的内容进行Web垃圾处理

URL='someurl.com' def datascrape(url): page=requests.get(url) print ("requesting page") soup = BeautifulSoup(page.content, "html.parser") return(soup) soup=datascrape(URL) results = {} for row in soup.findAll('tr'): aux = row.findAll('td') try: if "Status" in (aux.stripped_strings): key=(aux[0].strings) value=(aux[1].string) results[key] = value except: pass print (results)

2条回答

网友

1楼 · 编辑于 2024-09-27 19:23:33

我不知道你为什么要用findAll（）而不是find_all（），因为我对网页抓取还比较陌生，但是我认为这会给你带来你想要的结果。你知道吗

URL='http://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.html'
def datascrape(url):
    page=requests.get(url)
    print ("requesting page")
    soup = BeautifulSoup(page.content,     
"html.parser")
    return(soup)

soup=datascrape(URL)

results = {}
table_rows = soup.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    try:
        for i in row:
            if "Status" in i:
                key=(row[0].strip())
                value=(row[1].strip())
                results[key] = value
    else:
        pass
print(results)

希望这有帮助！你知道吗

网友

2楼 · 编辑于 2024-09-27 19:23:33

如果只是在状态和不适用之后，可以使用位置n类型的css选择器。这取决于页面的位置是否相同。你知道吗

import requests
from bs4 import BeautifulSoup

url ='https://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.htm'
page=requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
tdCells = [item.text.strip() for item in soup.select('table:nth-of-type(2) tr:nth-of-type(1) td')]
results = {tdCells[0] : tdCells[1]}
print(results)

相关问题更多 >

编程相关推荐

热门问题

热门文章