如何从mediawiki提取纯文本？ - 问答 - Python中文网

如何从mediawiki提取纯文本？

2024-09-28 20:45:11 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我从https://awoiaf.westeros.org/index.php/Special:Export导出了一些类别~~它们是XML格式的。~~我想要“大纲”部分的纯文本。你可以下载整个东西here（54KB压缩）

典型的概要部分如下所示：

==Synopsis== [[Catelyn Tully|Catelyn]] listens to the continuous pounding noise of the drums the musicians in the hall are playing. She is seated between [[Ryman Frey]] and [[Roose Bolton]] during the wedding feast. She remarks to herself how joyless the wedding is, and watches as [[Robb Stark|Robb]] dances with several of the Frey maids and [[Edmure Tully|Edmure]] dotes on his soon to be wife, [[Roslin Frey|Roslin]]. Catelyn becomes more wary when she learns that [[Olyvar Frey|Olyvar]], [[Perwyn Frey|Perwyn]], and [[Alesander Frey]] are all not in attendance at the wedding. She notices [[Merrett Frey]] trying to drink the [[Greatjon Umber|Greatjon]] under the table, and finally Lord [[Walder Frey]] calls for the bedding. Robb does not participate as the Greatjon carries a weeping Roslin to the bed chamber.

如何从所有概要部分提取纯文本

Tags： and of the to in 文本 are she

1条回答

网友

1楼 · 发布于 2024-09-28 20:45:11

首先，需要将其解析为XML。我建议使用lxml和xpath

from lxml import etree

tree = etree.parse('file.xml')
expression = '/m:mediawiki/m:page/m:revision/m:text/text()'
namespaces = {"m": "http://www.mediawiki.org/xml/export-0.10/"}
texts = tree.xpath(expression, namespaces=namespaces)

获得所有文本部分后，使用正则表达式逐个解析它们。或者编写自己的解析器

相关问题更多 >

编程相关推荐

热门问题

热门文章