<p>我有一个手动输入文件,由引文组成,每个引文的格式如下:</p>
<blockquote>
<p>< S sid ="2" ssid = "2">It differs from previous machine
learning-based NERs in that it uses information from the whole
document to classify each word, with just one classifier.< /S>< S sid
="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier,
which corrects the mistakes of a primary sentence- based classifier.<
/S></p>
</blockquote>
<p>下面是我目前使用python的re模块的方法:</p>
<pre><code>citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")
</code></pre>
<p>我试图提取从第一个结束的尖括号(“>;”)到最后一个开始的尖括号(“<;”)的所有内容。但是,在多个citance的情况下,这种方法会失败,因为中间标记也会在输出中提取:</p>
<blockquote>
<p>It differs from previous machine learning-based NERs in that it uses
information from the whole document to classify each word, with just
one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves
the gathering of information from the whole document often uses a
secondary classifier, which corrects the mistakes of a primary
sentence- based classifier.</p>
</blockquote>
<p>我想要的输出:</p>
<blockquote>
<p>It differs from previous machine learning-based NERs in that it uses
information from the whole document to classify each word, with just
one classifier. Previous work that involves
the gathering of information from the whole document often uses a
secondary classifier, which corrects the mistakes of a primary
sentence- based classifier.</p>
</blockquote>
<p>我如何才能正确地实现这一点?你知道吗</p>