如何从mediawiki提取纯文本?

2024-09-28 20:45:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我从https://awoiaf.westeros.org/index.php/Special:Export导出了一些类别它们是XML格式的。我想要“大纲”部分的纯文本。你可以下载整个东西here(54KB压缩)

典型的概要部分如下所示:

==Synopsis== [[Catelyn Tully|Catelyn]] listens to the continuous pounding noise of the drums the musicians in the hall are playing. She is seated between [[Ryman Frey]] and [[Roose Bolton]] during the wedding feast. She remarks to herself how joyless the wedding is, and watches as [[Robb Stark|Robb]] dances with several of the Frey maids and [[Edmure Tully|Edmure]] dotes on his soon to be wife, [[Roslin Frey|Roslin]]. Catelyn becomes more wary when she learns that [[Olyvar Frey|Olyvar]], [[Perwyn Frey|Perwyn]], and [[Alesander Frey]] are all not in attendance at the wedding. She notices [[Merrett Frey]] trying to drink the [[Greatjon Umber|Greatjon]] under the table, and finally Lord [[Walder Frey]] calls for the bedding. Robb does not participate as the Greatjon carries a weeping Roslin to the bed chamber.

如何从所有概要部分提取纯文本


Tags: andofthetoin文本areshe
1条回答
网友
1楼 · 发布于 2024-09-28 20:45:11

首先,需要将其解析为XML。我建议使用lxml和xpath

from lxml import etree

tree = etree.parse('file.xml')
expression = '/m:mediawiki/m:page/m:revision/m:text/text()'
namespaces = {"m": "http://www.mediawiki.org/xml/export-0.10/"}
texts = tree.xpath(expression, namespaces=namespaces)

获得所有文本部分后,使用正则表达式逐个解析它们。或者编写自己的解析器

相关问题 更多 >