如何解析没有类的表并保持分组

2024-06-19 19:13:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试解析下面的urlhttp://www.trimslabs.com/mic/300.htm以获取IUPAC、MIC和Organism菌株。在某种程度上,我能够做到这一点,尽管我不知道如何将结果分组。这是我到目前为止得到的

import bs4
from bs4 import BeautifulSoup as soup 
from urllib.request import urlopen as uReq
myurl = 'http://www.trimslabs.com/mic/300.htm'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#grab IUPACs
tables = page_soup.findAll("table")
table = tables[0]
IUPACS = []
for i in range (1, 454, 3):
    IUPACs = tables[i].find(text = "IUPAC").findNext('td').get_text(",",     strip = True)
    print(IUPACs)
for i in range (455, 661, 3):
    IUPACs_two = tables[i].find(text = "IUPAC").findNext('td').get_text(",", strip = True)
    print(IUPACs_two)
#grab organism names
organism_list = page_soup.findAll("i")
org = organism_list[1]
for org in organism_list:
    organism = org.text
    print(organism)
#get the MIC numbers
for org in organism_list:
    numbers = org.findNext('td').get_text(",", strip = True)
    print(numbers)

这将打印出我想要的大部分内容,但我完全失去了与它们相关联的抗生素(IUPAC)编号的信息。意识到每种抗生素都有3个表,我也尝试了以下方法

chem_tables = []
name_tables = []
org_tables = []
results_tables = []
for i in range (0, 451, 3):
    # 1.  Establish three tables per document
    chem_tables.append(tables[i])
    name_tables.append(tables[i + 1].find(text = "IUPAC").findNext('td').get_text(",", strip = True))
    org_tables.append(tables[i + 2].findAll("i"))
    results_tables.append(tables[i + 2].findAll("i").findNext('td'))

这很好,因为现在chem_tables[0]org_tables[0]name_tables[0]都引用了一种药物,但我一辈子都搞不清楚如何从org_tables中得到个体的名称,同时又不丢失它们与哪种药物相关的信息

我已经为这个问题绞尽脑汁两天了。任何帮助都将不胜感激


Tags: textinorgfortablesgetpagetd
1条回答
网友
1楼 · 发布于 2024-06-19 19:13:51

我会这样做:

1)找到IUPAC细胞

2)得到值

3)从IUPAC单元格中查找最近的表

4)找到所有表行,跳过前两行和最后一行(无用数据)

5)对于第二行单元格中的每一行,找到font值和的所有Organism标记

6)从第三行单元格中获取每个值,以获取MIC

7)从5)中获取每个值并存储到列表中

8)用逗号分隔6)并存储到列表中

9)把每件事都编入字典

示例代码:

from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.trimslabs.com/mic/300.htm')

soup = BeautifulSoup(response.content, "html.parser")

MicDatabase = []

for IUPAC in soup.find_all(text="IUPAC"):
    Value = IUPAC.find_next('td').get_text(",", strip = True)

    for tr in IUPAC.find_next('table').find_all("tr")[2:-1]:
        td = tr.find_all("td")[1:]

        Organism = td[0].find_all("font")
        MIC = td[1].get_text(",", strip = True)

    MicDatabase.append(
        {
            "IUPAC": Value,
            "ActivityData": {"Organism": [o.get_text(" ", strip=True) for o in Organism], "MIC": MIC.split(',')}
        })

输出:

[{'ActivityData': {'MIC': [u'2-4', u'1-2', u'1-2', u'1-2', u'2-4', u'2-4', u'2-4', u'1-2', u'>16', u'2-4', u'1-2', u'0.25 - 0.5', u'0.25 - 0.5'], 'Organism': [u'B. pumilus ATCC 14348', u'S. epidermidis ATCC 155', u'E. faecalis ATCC 35550', u'S. aureus ATCC 25923', u'S. aureus ATCC 9144', u'S. aureus ATCC 14154', u'S. aureus ATCC 29213', u'S. aureus ATCC 700699', u'(methicillin-resistant)', u'S. aureus NRS 119', u'(linezolid-resistant)', u'E.faecalis ATCC 14506', u'E.faecalis ATCC 700802', u'(vancomycin-resistant)', u'S.pyogenes ATCC 14289', u'S.pneumoniae ATCC 700904', u'(penicillin-resistant)']}, 'IUPAC': u'2-[(S)-3-(3-Fluoro-4-morpholin-4-yl-phenyl)-2-oxo-oxazolidin-5-yl]-acetamide'}...

相关问题 更多 >