网站抓取

<td class="nw">1. FC Köln</td> <td class="nw">Hamburger SV</td> <td class="nw">3 - 7 - 10</td> <td class="kicktipp-tippabgabe "> <input name="spieltippForms[401969217].tippAbgegeben" id="spieltippForms_401969217_tippAbgegeben" value="true" type="hidden"/> <input id="spieltippForms_401969217_heimTipp" name="spieltippForms[401969217].heimTipp" type="tel" value="2" size="2" maxlength="3"/>: <input id="spieltippForms_401969217_gastTipp" name="spieltippForms[401969217].gastTipp" type="tel" value="2" size="2" maxlength="3"/> </td> </tr> <tr> <td class="nw kicktipp-time">26.08.17 15:30</td> <td class="nw">Bayer 04 Leverkusen</td> <td class="nw">1899 Hoffenheim</td> <td class="nw">6 - 3 - 10</td> <td class="kicktipp-tippabgabe "> <input name="spieltippForms[401969218].tippAbgegeben" id="spieltippForms_401969218_tippAbgegeben" value="true" type="hidden"/> <input id="spieltippForms_401969218_heimTipp" name="spieltippForms[401969218].heimTipp" type="tel" value="2" size="2" maxlength="3"/>: <input id="spieltippForms_401969218_gastTipp" name="spieltippForms[401969218].gastTipp" type="tel" value="2" size="2" maxlength="3"/> </td> </tr> <tr> <td class="nw kicktipp-time"/> ...

2条回答

网友

1楼 · 编辑于 2024-09-30 05:27:55

另一种获取带有数字的id的方法是使用这样的代码。你知道吗

>>> from lxml import html
>>> tree = html.parse('table.htm')
>>> tree.xpath('.//input[contains(@id,"_heimTipp")]/@id')
['spieltippForms_401969217_heimTipp', 'spieltippForms_401969218_heimTipp']

我不知道在id的值中会发现什么样的可变性，所以说如何处理这些值并不容易。但可能很简单

>>> ids = tree.xpath('.//input[contains(@id,"_heimTipp")]/@id')
>>> numbers = [int(id.split('_')[1]) for id in ids]
>>> numbers
[401969217, 401969218]

网友

2楼 · 编辑于 2024-09-30 05:27:55

在xpath表达式中，您不需要'//td[@class="nw"]/text()'，因为这将获取以class="nw"作为属性的标记之间的值。相反，基于您提供的html和所需的输出，您应该尝试获取input标记的name属性并解析该值。你知道吗

from lxml import html
import re

h = html.fromstring('''<table><tr><td class="kicktipp-tippabgabe ">
  <input name="spieltippForms[401969217].tippAbgegeben" id="spieltippForms_401969217_tippAbgegeben" value="true" type="hidden"/>
  <input id="spieltippForms_401969217_heimTipp" name="spieltippForms[401969217].heimTipp" type="tel" value="2" size="2" maxlength="3"/>:
  <input id="spieltippForms_401969217_gastTipp" name="spieltippForms[401969217].gastTipp" type="tel" value="2" size="2" maxlength="3"/>
</td>
</tr>
<tr>
  <td class="nw kicktipp-time">26.08.17 15:30</td>
  <td class="nw">Bayer 04 Leverkusen</td>
  <td class="nw">1899 Hoffenheim</td>
  <td class="nw">6 - 3 - 10</td>
  <td class="kicktipp-tippabgabe ">
    <input name="spieltippForms[401969218].tippAbgegeben" id="spieltippForms_401969218_tippAbgegeben" value="true" type="hidden"/>
    <input id="spieltippForms_401969218_heimTipp" name="spieltippForms[401969218].heimTipp" type="tel" value="2" size="2" maxlength="3"/>:
    <input id="spieltippForms_401969218_gastTipp" name="spieltippForms[401969218].gastTipp" type="tel" value="2" size="2" maxlength="3"/>
  </td>
</tr>
</table>''')

numbers = [int(x) for e in h.xpath('//input[@type="hidden"]') 
              for x in re.findall(r'\[(\d+)\]', e.get('name'))]

numbers
# returns:
[401969217, 401969218]

相关问题更多 >

编程相关推荐

热门问题

热门文章

网站抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >