使用pandas和bs4解析被刮下的网页的输出:如何使输出更具可读性?

2024-09-30 18:15:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我想刮this

我写了这个代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://yadamp.unisa.it/showItem.aspx?yadampid=18")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

但产量并不理想。输出为:

[{"0":"ID","1":"18","2":"NAME","3":"Colutellin-A Blast NCBI-PROT","4":null,"5":null},{"0":"LENGTH","1":"7","2":"DISULFIDE  BRIDGE","3":null,"4":"View PDB  \/\/ Small molecules can be embedded in the page  var glmol02 = new GLmol('glmol02');","5":null},{"0":"SEQUENCE","1":"VISIIPV","2":null,"3":null,"4":null,"5":null},{"0":"HELICITY","1":"85.70","2":"INSTAB. INDEX","3":"31.97","4":"FLEXIBILITY","5":"5.43"},{"0":"a HYD. MOM.","1":"16.35","2":"b HYD. MOM.","3":"9.04","4":"c HYD. MOM","5":"1.37"},{"0":"a MEAN HYD.  MOM.","1":"2.34","2":"b MEAN HYD.  MOM.","3":"1.29","4":"c MEAN HYD.  MOM.","5":"0.20"},{"0":"CHARGE pH5","1":"0.00","2":"CHARGE pH7","3":"0.00","4":"CHARGE pH9","5":"-0.17"},{"0":"\u0394 CHARGE pH5-pH9","1":"0.17","2":"ISOELECTRIC POINT","3":"5.49","4":"BOMAN INDEX","5":"-2.78"},{"0":"\u0394G","1":"-368","2":"CPP","3":"-027","4":"MLP","5":"-006"},{"0":"MOLECULAR VOLUME","1":null,"2":"POLARITY","3":null,"4":null,"5":null},{"0":"MIC E. coli","1":null,"2":"MIC P. aeruginosa","3":null,"4":"MIC S. typhimurium","5":null},{"0":"MIC S. aureus","1":null,"2":"MIC M. luteus","3":null,"4":"MIC B. subtilis","5":null},{"0":"MIC C. albicans","1":null,"2":"OTHER","3":"S.sclerotiorum = 30.86; B.cinerea = 10.29","4":null,"5":null},{"0":"MIC OTHER  gram+","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"MIC OTHERgram-","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"PHYLUM","1":"Ascomycota","2":"CLASS","3":"Sordariomycetes","4":"ORDER","5":"Glomerellales"},{"0":"FAMILY","1":"Glomerellaceae","2":"GENUS","3":"Colletotrichum","4":"SPECIES","5":"Colletotrichum dematium"},{"0":"DATE","1":"2008","2":null,"3":null,"4":null,"5":null},{"0":"TITLE PAPER","1":"Colutellin A, an immunosuppressive peptide from Colletotrichum dematium","2":null,"3":null,"4":null,"5":null}]

你可以看到我很难理解这个列表,因为我必须循环浏览多个字典的列表,然后将成对的键连接在一起。我希望输出更像:

ID 18
Name Colutellin-A
Helicity 85.7

等等…只是一些更可读的东西。有人能指出我应该修改的代码部分来改进它吗

谢谢


Tags: 代码fromimporttablemeanrequestsnullpd
1条回答
网友
1楼 · 发布于 2024-09-30 18:15:23

您可以使用pandasread_html()获取表,然后使用pandasDataFrame()导航表,请参见下面的代码

url = 'http://yadamp.unisa.it/showItem.aspx?yadampid=18'
table = pd.read_html(url, attrs={
    'class': 'table table-responsive'}, header=0)
print(pd.DataFrame(table[0]))

相关问题 更多 >