如何使用XPath获取同一字段的内容?

2024-05-07 21:39:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Xpath的初学者,只是不能正确匹配内容。我的问题是:

如何使用XPath获取日期“2010.09.07”(在申请日:) 和“2009.09.03”?实际上,在g_列表层次结构下有10个相同的项(g_项),这里我只列出了其中的两个。我尝试从Chorm复制Xpath,但它不起作用

另外,我尝试如下使用正则表达式,但是,它只与第一个匹配。是否有方法返回所有项目的所有日期

谢谢

s.find(string=('申请日:')).find_next().text.replace('\n', '').strip()

<div class="g_list"> <div class="g_item"> <div class="g_tit"> <ul> <li class="g_li0"> <input id="CN201010274593.21" name="recordno" type="checkbox" value="CN201010274593.2" pnm="CN102403785B" sysid="B58C6C20BB7D5998B03811E0866F5981" appid="201010274593.2" sectionName="FMSQ" onclick="checkall()"/></li> <input id="tifPath1" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2014/20140716/201010274593.2,12,CN201010274593.2" xmlvalue="FMSQ,CN201010274593.2,2014.07.16" pdfvalue="Granted_patent_for_invention/2014/20140716/CN102403785B/PDF_PID/CN102010000274593CN00001024037850BPDFZH20140716CN008.PDF,CN201010274593.2" pdfvalue2="CN102403785B,2014.07.16"/> <li class="g_li" onclick="viewDetail(0)" style="cursor:pointer" name='patti' title="电源管理装置及其电源管理方法"> 1.电源管理装置及其电源管理方法</li> <li class="g_li1">发明授权 </li> <li class="g_li2 cor3">无效</li> <li class="g_li3"><a href="javascript:noAction();" onclick="downloadRecord('B58C6C20BB7D5998B03811E0866F5981','FMSQ');">下载</a></li> </ul> <div class="clear"></div> </div> <div class="g_cont"> <div class="g_cont_left"> <table cellpadding="0" cellspacing="0" border="0"> <tr> <td><span>申请号:</span> <a href="javascript:viewDetail(0);">CN201010274593.2</a> </td> <td><span>申请日:</span> 2010.09.07 </td> </tr> <tr> <td><span>公开(公告)号:</span> CN102403785B </td> <td><span>公开(公告)日:</span> 2014.07.16 </td> </tr> <tr> <td><span>同日申请: </td> <td><span>分案原申请号: </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 鸿富锦精密工业(深圳)有限公司;鸿海精密工业股份有限公司 </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H02J13/00(2006.01) </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td> </tr> <tr> <td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span> <a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td> </tr> </table> </div> <div class="g_cont_rig" id="pic1"> <a href="http://pic.cnipr.com/XmlData/SQ\20140716\201010274593.2/201010274593.gif"><img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ\20140716\201010274593.2/201010274593.gif" class="imgstyle"/></a> </div> <div class="clear"></div> </div> </div> <div class="g_item"> <div class="g_tit"> <ul> <li class="g_li0"> <input id="CN200910171675.12" name="recordno" type="checkbox" value="CN200910171675.1" pnm="CN102006581B" sysid="E7025BBD105585DF6CE4193E52ECC322" appid="200910171675.1" sectionName="FMSQ" onclick="checkall()"/></li> <input id="tifPath2" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2013/20130911/200910171675.1,21,CN200910171675.1" xmlvalue="FMSQ,CN200910171675.1,2013.09.11" pdfvalue="Granted_patent_for_invention/2013/20130911/CN102006581B/PDF_PID/CN102009000171675CN00001020065810BPDFZH20130911CN008.PDF,CN200910171675.1" pdfvalue2="CN102006581B,2013.09.11"/> <li class="g_li" onclick="viewDetail(1)" style="cursor:pointer" name='patti' title="IP地址强制续约的方法及装置"> 2.IP地址强制续约的方法及装置</li> <li class="g_li1">发明授权 </li> <li class="g_li2 cor3">无效</li> <li class="g_li3"><a href="javascript:noAction();" onclick="downloadRecord('E7025BBD105585DF6CE4193E52ECC322','FMSQ');">下载</a></li> </ul> <div class="clear"></div> </div> <div class="g_cont"> <div class="g_cont_left"> <table cellpadding="0" cellspacing="0" border="0"> <tr> <td><span>申请号:</span> <a href="javascript:viewDetail(1);">CN200910171675.1</a> </td> <td><span>申请日:</span> 2009.09.03 </td> </tr> <tr> <td><span>公开(公告)号:</span> CN102006581B </td> <td><span>公开(公告)日:</span> 2013.09.11 </td> </tr> <tr> <td><span>同日申请: </td> <td><span>分案原申请号: </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 中兴通讯股份有限公司 </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H04W8/08(2009.01);H04W36/14(2009.01);H04W84/12(2009.01);H04L29/12(2006.01) </td> </tr> <tr> <td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td> </tr> <tr> <td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span> <a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td> </tr> </table> </div> <div class="g_cont_rig" id="pic2"> <a href="http://pic.cnipr.com/XmlData/SQ/20130911/200910171675.1/200910171675.gif"><img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ/20130911/200910171675.1/200910171675.gif" class="imgstyle"/></a> </div> <div class="clear"></div> </div> </div> </div> </div>

Tags: namedividstylelijavascripttrclass
1条回答
网友
1楼 · 发布于 2024-05-07 21:39:48

您的HTML有多个标记验证错误,您可以使用W3 validator检查错误。但是,如果修复了以下错误,则可以使用lxml解析字符串

Unclosed element span. From line 43, column 37; to line 43, column 42
Unclosed element span. From line 46, column 37; to line 46, column 42
Unclosed element span. From line 119, column 37; to line 119, column 42
Unclosed element span. From line 122, column 37; to line 122, column 42
Stray end tag div. From line 159, column 5; to line 159, column 10

from lxml import etree

pageHTML = """
<div class="g_list">
        <div class="g_item">
        ...
        ...
"""

root = etree.fromstring(pageHTML)

dateList = root.xpath("//*[@class='g_cont_left']/table/tr[1]/td[2]/text()")
print(dateList)

#[' 2010.09.07 ', ' 2009.09.03 ']

如果您仍然想使用正则表达式(我建议您不要使用正则表达式,因为我们讨论了在HTML、XML等上应用正则表达式,或者使用特定于该语法的解析器),那么您可以定义一个捕获组,只允许数字和一个由您希望找到的确切单词包围的文本.([\d\.]+)

import re

pageHTML = """
<div class="g_list">
        <div class="g_item">
        ...
        ...
"""
date_Regex = re.findall("申请日:\s*</span>\s*([\d\.]+)\s*</td>", pageHTML)
print(date_Regex)

# ['2010.09.07', '2009.09.03']

相关问题 更多 >