BeautifulSoup:从jatsxml中提取数字和字幕

2024-07-07 08:47:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从jatsxml获取图像和它的描述。在我的示例中,我使用http://journal.frontiersin.org/article/10.3389/fpls.2011.00008/xml/nlm

这些数字的格式如下:

<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Pathways of DSB misrepair...</p></caption>
<graphic xlink:href="fpls-02-00008-g001.tif"/>
</fig>

我想得到每个图形的<caption>...</caption><graphic xlink:href="..."/>的内容。你知道吗

所以我的想法是使用BeautifulSoup的css选择器,在打印时去掉html标记:

#!/usr/bin/python

from bs4 import BeautifulSoup
import urllib.request

content = urllib.request.urlopen('file:///tmp/fpls-02-00008.xml').read()
soup = BeautifulSoup(content, 'xml')

##<fig><caption>XXX</caption></fig>
caption = soup.select("fig caption")

##<fig><graphic xlink:href="YYY"/></fig>
graphic = soup.select("fig graphic")

for a in caption:
    print(a.get_text().strip())

#print(b.get_text()) doesn't work
for b in graphic:
    print(b)

#separator = "|"
#print(separator.join([caption, graphic]))

只得到标题或只是图形工程,但由于不一致的来源,我需要得到这两个在同一时间。结果不应该是

  • 标题A
  • 标题B
  • 图形A
  • 图形B

而是

  • 标题A,图形A
  • 标题B,图形B

我如何做到这一点?提前谢谢!你知道吗


Tags: import图形标题figxmlurlliblabelxlink
2条回答

您可以先选择fig元素,然后在同一循环中选择captiongraphic。你知道吗

fig = soup.select("fig")
for e in fig:
    print(e.select('caption')[0].get_text().strip())
    print(e.select('graphic')[0]['xlink:href'])

输出:

Pathways of DSB misrepair via single-strand annealing(SSA) or via synthesis-dependent strand annealing (SDSA). (A) Deletion via exonucleolytic 5′-end resection, SSA at complementary overhang sequences, resection of the non-aligned ends, and ligation of break-ends. (B) Insertion into a DSB by break-end invasion and elongationalong an ectopic and partially homologous (vertical bars) template.(C) Re-synthesis of break-ends after invasion into a homologous template double-strand without (gene conversion) or with exchange of flanking regions due to appropriate resolution of Holiday junctions (greenarrow heads).
fpls-02-00008-g001.tif
Schematic models of replication and chromosome labeling patterns after BIR at proximal DSB ends in S and G2. (A) BIR through conservative replication of a one ended DSB during S phase. The DSB appears when the replication fork arrives at a single-strand break (arrow head). Conservative replication occurs via recurrent strand invasion (or via unidirectional fork migration) without resolution of the Holiday junction(s) using the parental double strand as a template. The result after EdU incorporation is an asymmetrically unlabeled terminal chromatid region. (B) BIR during G2 phase, through conservative replication at the proximal end of a DSB (arrow head) via recurrent strand invasion and/or via unidirectional fork migration without resolution of the Holiday junction(s) using the undamaged sister double helix as a template. The result after EdU incorporation is an asymmetrically labeled terminal chromatid region. (C) BIR during G2 phase through semiconservative replication achieved by resolution of the Holiday junction (green arrow head) after invasion of the elongating break-end into the template double strand. The result after EdU incorporation is a symmetrically labeled distal chromatid region. Full lines unlabeled; broken lines labeled by EdU. The distal fragment of the broken double helix in (B,C) gets lost.
fpls-02-00008-g002.tif
Metaphase chromosomes of the field bean. (A) Chromatid-type aberrations after bleomycin treatment. Left cell: isochromatid break (arrow head), the centric, and the acentric chromatid fragments are surrounded by black dots, the homologous undamaged chromosome is surrounded by white dots. Middle cell: symmetric reciprocal chromatid translocation (arrow) and two terminal chromatid breaks (arrow heads). The latter with the broken fragment either switched to the opposite site of the undamaged sister chromatid (left) or being at least 90° apart from the other break-end as in case of the broken secondary constriction (right). Right cell: interstitial deletion (arrow), the deleted fragment remains attached to the undamaged sister chromatid, the chromosome involved is surrounded by black dots. (B) Interstitial asymmetric chromatid labeling (arrows) after bleomycin treatment in the presence of EdU during S phase. (C) Interstitial asymmetric chromatid labeling (arrows) after bleomycin treatment in the presence of EdU during G2. The asymmetric signals appear on chromosomes II, IV, V, and VI, respectively, at interstitial heterochromatic regions composed of homologous tandem repeats (Fuchs et al., 1994).
fpls-02-00008-g003.tif

您可以使用zip同时循环浏览两个列表:

>>> A = [1,2,3,4,5]
>>> B = ['A','B','C','D','E']
>>> for number,letter in zip(A,B):
...     print number,letter
... 
1 A
2 B
3 C
4 D
5 E
>>> 

相关问题 更多 >