在Python中使用相对XPath访问表值

2024-09-28 19:04:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图找到一个相对的Xpath(不是绝对的Xpath),它允许我从这个url提取数据:https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm

我的代码在下面。SalesB返回一个值('233715'),但SalesA返回空值。我做错什么了?你知道吗

from lxml import html
import requests

SEC_pageA = requests.get('https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm')
SEC_treeA = html.fromstring(SEC_pageA.content)
SalesA = SEC_treeA.xpath('(//p[contains(., "CONSOLIDATED STATEMENTS OF INCOME")]/following::td[contains(.,"Net sales")]/following-sibling::td[@align="right"]//text())[1]')

SEC_pageB = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm')
SEC_treeB = html.fromstring(SEC_pageB.content)
SalesB = SEC_treeB.xpath('(//p[contains(., "CONSOLIDATED STATEMENTS OF OPERATIONS")]/following::td[contains(.,"Net sales")]/following-sibling::td[@align="right"]//text())[1]')

print SalesA
print SalesB

SalesB返回如下所示的值,该值可以通过secu pageA变量找到(参见https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm)。你知道吗

enter image description here

我希望SalesA返回“净销售额”的数字,可以在下面看到(即6538336),在这里找到:https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm

enter image description here


Tags: httpsdatawwwsectdgovfollowingarchives
1条回答
网友
1楼 · 发布于 2024-09-28 19:04:41

这是因为有些文本不在一行中,因为xpath无法找到您真正想要的内容。你知道吗

from lxml import html
import requests

xpath_a = """
//*[text()[contains(., "CONSOLIDATED
      STATEMENTS OF INCOME")]]/following::td[contains(., "Net
      sales")][1]/following-sibling::td[@valign="bottom"][3]//text()
      """

SEC_pageA = requests.get('https://www.sec.gov/Archives/edgar/data/1000228/000100022810000006/the10k_2009.htm')
SEC_treeA = html.fromstring(SEC_pageA.content)
SalesA = SEC_treeA.xpath(xpath_a)

print(SalesA)

印刷品

['6,538,336']

相关问题 更多 >