如何使用scrapy从未知的N个childern p标记获取文本?

2024-06-28 19:14:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在试图了解事件的描述。但问题是所有事件的描述都是任意的<p>标记。那么,我们如何访问<p>标记以获取其文本呢

<div id='main'>
   <div class='templatecontent'>
       <h3>Evening Tide Talk-POSTPONED<img alt="" src="https://assets.speakcdn.com/assets/2204/hj_scope-2020022008493216.jpg" style="margin: 4px 14px; float: right; width: 300px; height: 463px;" /></h3>

       <p><strong>March 25th | 5:45 p.m. </strong></p>

       <p><strong>Dr. Heather Judkins</strong></p>

       <p><strong>University of South Florida St. Petersburg, Department of Biological Sciences</strong></p>

       <p><strong><em>Lessons Learned from Exploring the Deep</em></strong></p>
       <!-- I want to get this Paragraph --!>

       <p>In her talk, Heather will share lessons learned and some unexpected finds from her journeys. Join us as she discusses unique cephalopod adaptations and memorable moments, while also sharing some “giant” findings from her most recent Gulf of Mexico cruise that led to breaking news in June 2019’s New York Times!</p>

       <p><a class="button-primary" href="/eveningtidetalks">Learn More</a></p>

       <p> </p>

       <p> </p>

       <p> </p>

       <hr />
       <h3>Washed Ashore - Art To Save The Sea <img alt="" src="https://assets.speakcdn.com/assets/2204/tfa_washed_ashore_exhibit_priscilla2.png" style="margin: 3px 13px; float: right; width: 300px; height: 300px;" /></h3>

       <p><strong><strong>February 29th - August 31st</strong></strong></p>

       <!-- I want to get this Paragraph --!>
       <p>In honor of the Aquarium's 25th Anniversary celebration, we are proud to host Washed Ashore - Art To Save The Sea from now until the end of August! The nationally acclaimed exhibit artistically showcases the impacts of plastic pollution on oceans, waterways and wildlife. Washed Ashore sculptures have traveled around the country and The Florida Aquarium is showcasing 18 larger than life sculptures of marine life. </p>

       <p><a class="button" href="/washed-ashore">Learn More</a></p>

       <p> </p>

       <hr />
   </div>
</div>

正如你在这里看到的


Tags: andofthetofromdiv事件h3
1条回答
网友
1楼 · 发布于 2024-06-28 19:14:39

您需要使用^{} axis的组合来选择与h3处于同一级别的<p>标记,然后将那些匹配p的标记限制为将text()作为直接子级的标记。但是,如果只执行p[text()],它将返回(或多或少)不理想的空<p> </p>。因此,对^{}进行进一步的限制,使其只返回看起来“有趣”的内容,从而产生:

def parse(self, response):
    main_div = response.css('#main')
    for h3 in main_div.xpath('.//h3'):
        talk_title = h3.xpath('text()').get()
        talk_summary = h3.xpath('./following-sibling::p[string-length(text()) > 2]/text()').get()

产生:

[
  {
    "talk_title": "Evening Tide Talk-POSTPONED",
    "talk_summary": "In her talk, Heather will share lessons learned and some unexpected finds from her journeys. Join us as she discusses unique cephalopod adaptations and memorable moments, while also sharing some “giant” findings from her most recent Gulf of Mexico cruise that led to breaking news in June 2019’s New York Times!"
  },
  {
    "talk_title": "Washed Ashore - Art To Save The Sea ",
    "talk_summary": "In honor of the Aquarium's 25th Anniversary celebration, we are proud to host Washed Ashore - Art To Save The Sea from now until the end of August! The nationally acclaimed exhibit artistically showcases the impacts of plastic pollution on oceans, waterways and wildlife. Washed Ashore sculptures have traveled around the country and The Florida Aquarium is showcasing 18 larger than life sculptures of marine life. "
  }
]

following-sibling::p轴表示匹配DOM中与XPath所在元素(在本例中为<h3>)处于同一级别的所有<p>元素,这将生成9<p>标记列表。p[]XPath语法表示进一步限制满足某个谓词的匹配p标记,其中string-length(text()) > 2表示立即文本子节点的字符串长度必须大于2。然后,在那些匹配<p>标记的节点中,返回第一个text子节点

相关问题 更多 >