使用beauthoulsoup提取未标记文本

2024-10-01 00:21:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我要提取一个未标记的XML文件。在

<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>

因此主体可能包含多个<p>标记。我正在寻找提取文本

"]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) ["

,它位于CR3和{}之间,依此类推(即在连续的xref之间)。我还需要将此文本添加到一个字典中,该字典将相应的rid映射到这些{}之后的此类文本列表。如何使用beauthoulsoup和/或regexp完成此操作。在


Tags: andoftherefistypewithrid
3条回答

这个怎么样?在

html = """
<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>
"""

import re
re.search('<xref ref-type="bibr" rid="CR3">3</xref>(.*)', a).group(1)

输出为:

^{pr2}$

检查一下(假设所有的rid值都以CR开头):

>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(xml) # xml is your xml string text
>>> xml_dict = {'CR' + x.next_element:x.next_sibling.strip() for x in soup.findAll('xref')}
>>> print(xml_dict)

{u'CR3': u',', 
 u'CR1': u']. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [', 
 u'CR6': u',', 
 u'CR7': u']. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [', 
 u'CR4': u']. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [', 
 u'CR5': u'].'}

下面的代码对我有用-它创建了一个字典(映射)!在

from bs4 import BeautifulSoup
from collections import defaultdict
import re

d= defaultdict(unicode)

html ='''
<body>
        <p>The prognosis of patients with rectal cancer has improved since the introduction of total mesorectal excision (TME) surgery [
            <xref ref-type="bibr" rid="CR1">1</xref>&#x02013;
            <xref ref-type="bibr" rid="CR3">3</xref>]. Using this surgical technique the mesorectal compartment including the rectum and perirectal fat is completely excised by sharp dissection along the mesorectal fascia (MRF) [
            <xref ref-type="bibr" rid="CR1">1</xref>]. Additionally, large randomized trials have shown that neo-adjuvant therapy improves local tumor control even further, regardless of optimized surgical techniques [
            <xref ref-type="bibr" rid="CR3">3</xref>, 
            <xref ref-type="bibr" rid="CR4">4</xref>]. The advances in rectal cancer treatment have provoked differentiated neo-adjuvant treatment strategies based on anatomical preoperative identifiable risk factors for local tumor recurrence as can be visualized with magnetic resonance imaging (MRI) [
            <xref ref-type="bibr" rid="CR5">5</xref>]. One of the most important risk factors is the tumor relationship to the MRF, which actually defines the surgical circumferential resection margin (CRM) in TME surgery [
            <xref ref-type="bibr" rid="CR6">6</xref>, 
            <xref ref-type="bibr" rid="CR7">7</xref>]. Long courses of neo-adjuvant chemoradiation have emerged as the preferential treatment of patients with anticipated tumor invasion of the MRF on MRI in order to downstage/downsize the tumor and to obtain tumor free resection margins [
            <xref ref-type="bibr" rid="CR5">5</xref>].
            </p>

</body>

'''

soup = BeautifulSoup(html,'html.parser')
l = soup.find_all('xref')
for i in l:
    e= i.next_element
    txt =  e.next_element.encode('utf-8')
    if re.match(r'\].+\[',txt) is not None:
        d[i.attrs['rid'].strip()]=txt.strip()
for k,v in d.items():
    print "The value of {0} is>>>>> {1} ".format(k,v)

它打印-

^{pr2}$

相关问题 更多 >