使用Python在通用网页中查找“日期”

2024-06-18 13:02:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想抓取在网上发布的新闻文章的确切发布时间。在

有些网页有很好的格式标题,我可以提取“上次修改”或“发布日期”,标题中的信息杂乱无章,但可以使用。(顺便说一下,metadata_parser帮助很大!)在

但是像BBC和CNN这样的大型新闻机构并没有把日期和时间信息放在html标题中。所以我试图从html代码中获取日期和发布时间。在

对于BBC来说,日期时间嵌入如下:

<div data-timestamp-inserted="true" class="date date--v2" data-seconds="1447658338" data-datetime="16 November 2015">16 November 2015</div>

对于CNN来说,它就像:

^{pr2}$

对于纽约时报

<p class="byline-dateline"><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" data-byline-name="AURELIEN BREEDEN" itemprop="name">AURELIEN BREEDEN</span>, </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><span class="byline-author" data-byline-name="KIMIKO DE FREYTAS-TAMURA" itemprop="name">KIMIKO DE FREYTAS-TAMURA</span> and </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person" itemid="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html"><a href="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html" rel="author" title="More Articles by KATRIN BENNHOLD"><span class="byline-author" data-byline-name="KATRIN BENNHOLD" itemprop="name">KATRIN BENNHOLD</span></a></span><time class="dateline" datetime="2015-11-16" itemprop="datePublished" content="2015-11-16">NOV. 16, 2015</time></p>

可以看出,几乎每个通讯社都有自己的方式把数据和时间放在网页上。在

我的问题是,有没有可能在BeautifulGroup中使用某种类型的模糊搜索来提取日期和时间信息,这样我就不必为每个网站编写规则了?在

谢谢!在


Tags: name信息http标题datahtml时间class
1条回答
网友
1楼 · 发布于 2024-06-18 13:02:49

以我的经验和拙见,获取通用信息的最佳方法是使用NER (Named-Entity Recognition)系统。在

我建议使用Scrapinghub的webstruct库:

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Github存储库:https://github.com/scrapinghub/webstruct

文档:http://webstruct.readthedocs.org/en/latest/

更新:

当您需要获取日期时,也可以使用Dateparser

dateparser provides modules to easily parse localized dates in almost any string formats commonly found on web pages.

Github存储库:https://github.com/scrapinghub/dateparser

文档:https://dateparser.readthedocs.org/en/latest/

相关问题 更多 >