Selenium/XPath在两个标记之间获取HTML

2024-09-27 07:27:31 发布

您现在位置:Python中文网/ 问答频道 /正文

如何提取

  • 消息名称(例如Message 1
  • 收到的时间戳(例如Received: 214-2342-234
  • 和最难的:消息文本(例如^{

是否使用Selenium 4(最好是XPath)创建此HTML? 我正在使用Python。

<body> <p class="pclass"> <a name="msg1"></a> Message 1: <a href="..."> Link1</a> <span> Received: 214-2342-234</span> </p> <br>This is message nr. 1 it contains different stuff like <b>bold text</b>, etc.<br><br> <p class="pclass"> <a name="msg2"></a> Message 2: <a href="..."> Link1</a> <span> Received: 214-46546-23532</span> </p> <br>Message nr. 2 may contain other stuff (maybe even a table...)<br><br> <p class="pclass"> <a name="msg3"></a> Message 3: <a href="..."> Link1</a> <a href="..."> Link2</a> <span> Received: 214-7876967666</span> </p> <br>This message contained 2 hyperlinks before the received-timestamp.<br><br> <a href="close.php">Close Messages</a> </body>

在节点内查询数据非常简单,但是在标记之间的某个地方查询文本怎么样? 我目前得到的是“msg1”、“msg2”…:

msgs = driver.find_elements(By.XPATH, "//a[starts-with(@name, 'msg')]")
print(msgs[0].get_attribute('name')) # prints 'msg1'

Tags: name文本br消息messagebodythisclass
3条回答

基于您的HTML Yes,您可以使用字符串操作和其他方法(如Splitlines()和js executor)来实现这一点

标识p标记,然后查找childNodes值,然后查找span文本

要中断行,您需要标识body tag,然后使用Splitlines()

代码:

i=2
for item in driver.find_elements_by_css_selector('p.pclass'):
    print(driver.execute_script('return arguments[0].childNodes[2].textContent;', item).strip())
    print(item.find_element_by_xpath("./span").text)
    message=driver.find_element_by_tag_name("body").text.splitlines()
    print(message[i])
    i=i+4
    print("#########################################")

控制台输出:

Message 1:
Received: 214-2342-234
This is message nr. 1 it contains different stuff like bold text, etc.
#########################################
Message 2:
Received: 214-46546-23532
Message nr. 2 may contain other stuff (maybe even a table...)
#########################################
Message 3:
Received: 214-7876967666
This message contained 2 hyperlinks before the received-timestamp.
#########################################

要从文本节点提取文本,必须为visibility_of_element_located()诱导WebDriverWait,可以使用以下解决方案:

  • 使用XPATHsplitlines()

    • 要提取消息1:

      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body/p[@class='pclass']"))).get_attribute("innerHTML").splitlines()[1])
      
    • 提取收到的:214-2342-234

      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body/p[@class='pclass']//span"))).text)
      
    • 要提取这是它包含的第1条消息

      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body"))).get_attribute("innerHTML").splitlines()[7])
      
  • 使用XPATH子节点

    • 要提取消息1:

      print(driver.execute_script('return arguments[0].childNodes[2].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body/p[@class='pclass']")))).strip())
      
    • 提取收到的:214-2342-234

      print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body/p[@class='pclass']//span"))).text)
      
    • 要提取这是它包含的第1条消息

      print(driver.execute_script('return arguments[0].childNodes[3].textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "body")))).strip())
      
  • 注意:您必须添加以下导入:

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    

参考资料

您可以在以下内容中找到一些相关的详细讨论:

  1. 消息名

    text = driver.find_element_by_xpath('/html/body/p[2]').get_attribute('innerText') name = text.split(':')[0] print(name)

  2. 接收时间戳

    timestamp = driver.find_element_by_xpath('/html/body/p[1]/span').get_attribute('innerText') print(timestamp)

  3. 消息文本

    import re

    message_text = driver.find_element_by_tag_name('body').get_attribute('innerText')print(re.findall(re.escape('\n\n\n')+"(.*)"+re.escape('\n\n\n\n'),message_text)[0])

相关问题 更多 >

    热门问题