从具有不同层次标记的web中提取表

1条回答

网友

1楼 · 发布于 2024-06-01 20:35:48

首先，\u2122只是™unicode字符的ASCII友好表示。如果你print()这个字符串，你会看到那个字符而不是那个。所以不用担心！你知道吗

那么，您的代码对我不起作用：

tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

正在返回一个列表，这使得无法执行以下操作：

rows.xpath('//td/a/text()')

所以我不明白你是怎么得到结果的。即使它是有效的，XPath也有一些你不能得到的东西，那就是//使搜索从文档的根开始，这就是为什么你要在td标记中得到a标记的所有内容，而不是你所在的tr标记中的内容。你知道吗

相反，如果使用相对xpath，则以下操作将起作用：

>>> rows[0].xpath('td/a')
[<Element a at 0x2e3ff50>, <Element a at 0x2e3ff00>]
>>> rows[0].xpath('td/a/text()')
['AatII', u'CutSmart\u2122 Buffer']

但问题是这样做太普通了，您将无法保留元素按兴趣顺序。可悲的是，没有一种自动的方法可以做到这一点有趣的东西。你知道吗

然后，您需要获取HTML，并确定您希望图像的alt位于那td，你想把span的内容放在另一个里面：

<tr>
    <td>
        <a href="/products/r0117-aatii">AatII</a>
    </td>
    <td>
        <img class="product-icon" longdesc="This enzyme is purified from a recombinant source." alt="recombinant" src="/~/media/Icons/icon_recomb.gif">
        <img class="product-icon" longdesc="This enzyme is capable of digesting 1 µg of DNA in 5 minutes." alt="timesaver 5min" src="/~/media/Icons/icon_timesaver5.gif">
        <img class="product-icon" longdesc="Cleavage with this restriction enzyme is blocked when the substrate DNA is methylated by CpG methylase." alt="cpg" src="/~/media/Icons/icon_cpg.gif">
    </td>
    <td>GACGT/C</td>
    <td>
        <a href="/products/b7204-cutsmart-buffer">CutSmart™ Buffer</a>
    </td>
    <td>10</td>
    <td>50*</td>
    <td>50</td>
    <td>100</td>
    <td>
        <span style="color:red;">80°C</span>
    </td>
    <td>37°C</td>
    <td>B </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Blocked" src="/~/media/Icons/Blocked.gif">
    </td>
    <td>λ DNA</td>
    <td></td>
</tr>

下面是从链接的文档中获取感兴趣的值：

>>> for row in rows: print row[0].xpath('a/text()'), [img.attrib['alt'] for img in row[1].xpath('img')], row[2].text, row[3].xpath('a/text()'), row[4].text, row[5].text, row[6].text, row[7].text, row[8].xpath('span/text()'), row[9].text, [img.attrib['alt'] for img in row[10].xpath('img')], [img.attrib['alt'] for img in row[11].xpath('img')], [img.attrib['alt'] for img in row[12].xpath('img')], row[13].text, row[14].text
['AatII'] ['recombinant', 'timesaver 5min', 'cpg'] GACGT/C [u'CutSmart\u2122 Buffer'] 10 50* 50 100 [u'80\xb0C'] 37°C [] ['Not Sensitive'] ['Not Sensitive'] None λ DNA
['AbaSI'] ['recombinant'] None ['NEBuffer 4'] 25 50 50 100 [] 25°C [] ['Not Sensitive'] ['Not Sensitive'] None None
['Acc65I'] ['recombinant', 'timesaver 5min', 'dcm', 'cpg'] G/GTACC ['NEBuffer 3.1'] 10 75* 100 25 [] 37°C [] ['Not Sensitive'] ['Blocked by Some Combinations of Overlapping'] None pBC4 DNA
...

得到了所有的领域。你知道吗

最后，为了使其易于重用，我将这样做：

 enzimes = [{ 'enzime'                     : row[0].xpath('a/text()'),
              'attributes'                 : [img.attrib['alt'] for img in row[1].xpath('img')],
              'Supplied NEBuffer'          : row[2].text,
              '% Activity in NEBuffer 1.1' : row[3].xpath('a/text()'),
              '% Activity in NEBuffer 2.1' : row[4].text,
              '% Activity in NEBuffer 3.1' : row[5].text,
              'CutSmart'                   : row[6].text,
              'Heat Inac.'                 : row[7].text,
              'Incu. Temp.'                : row[8].xpath('span/text()')[0] if len(row[8].xpath('span/text()')) > 0 else row[8].text,
              'Diluent'                    : row[9].text,
              'Dam'                        : [img.attrib['alt'] for img in row[10].xpath('img')],
              'Dcm'                        : [img.attrib['alt'] for img in row[11].xpath('img')],
              'CpG'                        : [img.attrib['alt'] for img in row[12].xpath('img')],
              'Unit Substrate'             : row[13].text,
              'Note'                       : row[14].text
            } for row in rows]

第一次，结果是：

>>> import pprint
>>> pprint.pprint(enzimes[0])
{'% Activity in NEBuffer 1.1': [u'CutSmart\u2122 Buffer'],
 '% Activity in NEBuffer 2.1': '10',
 '% Activity in NEBuffer 3.1': '50*',
 'CpG': ['Not Sensitive'],
 'CutSmart': '50',
 'Dam': [],
 'Dcm': ['Not Sensitive'],
 'Diluent': u'37\xb0C',
 'Heat Inac.': '100',
 'Incu. Temp.': u'80\xb0C',
 'Note': u'\u03bb DNA',
 'Supplied NEBuffer': 'GACGT/C',
 'Unit Substrate': None,
 'attributes': ['recombinant', 'timesaver 5min', 'cpg'],
 'enzime': ['AatII']}

HTH公司

相关问题更多 >

编程相关推荐

热门问题

热门文章

从具有不同层次标记的web中提取表

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >