<p>我最近碰到了<a href="https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language" rel="nofollow noreferrer">Standard Generalized Markup Language</a>。我从<a href="http://metashare.elda.org/repository/browse/the-emilleciil-corpus/abdd35c8de6f11e2b1e400259011f6ea6bce74d38dbb42d881da76c64a6adb20/" rel="nofollow noreferrer">EMILLE/CIIL Corpus</a>获得了SGML格式的语料库。以下是该语料库的文档:</p>
<p><a href="http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm" rel="nofollow noreferrer">EMILLE Corpus Documentation</a></p>
<p>我只想提取文件中的文本。文档中语料库的编码和标记信息为:</p>
<blockquote>
<p>The text is encoded as two-byte Unicode text. For more information on Unicode.
The texts are marked up in SGML using level 1 CES-compliant markup. Each file also includes a full header, which specifies the provenance of the text.</p>
</blockquote>
<p>我很难脱掉这些标签。我试过用“正则表达式”和“靓汤”但都不管用。这是示例文本文件。我想保留的语言是旁遮普语。在</p>
<p><a href="https://i.stack.imgur.com/VwBdl.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/VwBdl.png" alt="Sample text file"/></a></p>