从RSS提要解析重复的name元素

2024-10-04 01:31:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析这个RSS提要->;https://gh.bmj.com/rss/recent.xml 每个<item>块有2个名为<dc:identifier>的元素:

<item rdf:about="http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1">
<title>
<![CDATA[
Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal detection: a systematic review
]]>
</title>
<link>
http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1
</link>
<description>
<![CDATA[
<sec><st>Background</st> <p>Concerns regarding adverse events following vaccination (AEFIs) are a key challenge for public confidence in vaccination. Robust postlicensure vaccine safety monitoring remains critical to detect adverse events, including those not identified in prelicensure studies, and to ensure public safety and public confidence in vaccination. We summarise the literature examined AEFI signal detection using electronic healthcare data, regarding data sources, methodological approach and statistical analysis techniques used.</p> </sec> <sec><st>Methods</st> <p>We performed a systematic review using the Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. Five databases (PubMed/Medline, EMBASE, CINAHL, the Cochrane Library and Web of Science) were searched for studies on AEFIs monitoring published up to 25 September 2017. Studies were appraised for methodological quality, and results were synthesised narratively.</p> </sec> <sec><st>Result</st> <p>We included 47 articles describing AEFI signal detection using electronic healthcare data. All studies involved linked diagnostic healthcare data, from the emergency department, inpatient and outpatient setting and immunisation records. Statistical analysis methodologies used included non-sequential analysis in 33 studies, group sequential analysis in two studies and 12 studies used continuous sequential analysis. Partially elapsed risk window and data accrual lags were the most cited barriers to monitor AEFIs in near real-time.</p> </sec> <sec><st>Conclusion</st> <p>Routinely collected electronic healthcare data are increasingly used to detect AEFI signals in near real-time. Further research is required to check the utility of non-coded complaints and encounters, such as telephone medical helpline calls, to enhance AEFI signal detection.</p> </sec> <sec><st>Trial registration number</st> <p>CRD42017072741</p> </sec>
]]>
</description>
<dc:creator>
<![CDATA[ Mesfin, Y. M., Cheng, A., Lawrie, J., Buttery, J. ]]>
</dc:creator>
<dc:date>2019-07-08T21:52:19-07:00</dc:date>
<dc:identifier>info:doi/10.1136/bmjgh-2018-001065</dc:identifier>
<dc:identifier>hwp:master-id:bmjgh;bmjgh-2018-001065</dc:identifier>
<dc:publisher>BMJ Publishing Group Ltd</dc:publisher>
<dc:subject>
<![CDATA[ Open access ]]>
</dc:subject>
<dc:title>
<![CDATA[
Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal detection: a systematic review
]]>
</dc:title>
<prism:publicationDate>2019-07-08</prism:publicationDate>
<prism:section>Research</prism:section>
<prism:volume>4</prism:volume>
<prism:number>4</prism:number>
<prism:startingPage>e001065</prism:startingPage>
<prism:endingPage>e001065</prism:endingPage>
</item>

在这两个要素中:

<dc:identifier>info:doi/10.1136/bmjgh-2018-001065</dc:identifier>
<dc:identifier>hwp:master-id:bmjgh;bmjgh-2018-001065</dc:identifier>

我想要一个包含doi-info:doi/10.1136/bmjgh-2018-001065的,但是当我使用python feedparser(https://pythonhosted.org/feedparser/)时,我只得到第二个,我的假设是因为它得到了第一个元素的值,但是当它遇到同名的第二个元素时会覆盖它。有没有办法防止或克服这个问题?你知道吗


Tags: andthetoinfordatasecdc
2条回答

您可以从url下载带有urllib.request.urlretrieve的rss文件,然后使用minidom首先删除不需要的dc:identifier。之后,您可以使用feedparser访问所需的值。你知道吗

from xml.dom import minidom
from urllib import request
import feedparser
request.urlretrieve("https://gh.bmj.com/rss/recent.xml", "recent.xml")
xmldoc = minidom.parse('recent.xml')
itemlist = xmldoc.getElementsByTagName('dc:identifier')

for item in itemlist:
    if item.firstChild.nodeValue.startswith("hwp:"):
        p = item.parentNode
        p.removeChild(item)

file_handle = open("recent_modified.xml","w+")
xmldoc.writexml(file_handle)
file_handle.close()

d = feedparser.parse('recent_modified.xml')

for item in d.entries:
    print(item.dc_identifier)

在本例中,一个简单的正则表达式可以很好地实现这一点。你知道吗

In [1]: text = '''<item rdf:about="http://gh.bmj.com/cgi/content/short/4/4/e001065?rss=1"> 
   ...: <title> 
   ...: <![CDATA[ 
   ...: Use of routinely collected electronic healthcare data for postlicensure vaccine safety signal det
   ...: ection: a systematic review 
   ...: ]]> 
   ...: </title> 
   ...: <link>...'''

In [2]: import re                                                                                        

In [3]: re.findall('<dc:identifier>(info:doi.*?)</dc:identifier>', text)                                 
Out[3]: ['info:doi/10.1136/bmjgh-2018-001065']

如果文本在标记内包含换行符,可以先删除这些换行符:

text = text.replace('\n', '')

但在这种情况下,这似乎没有必要。你知道吗

相关问题 更多 >