用于处理mediawiki xml内容转储的python包

mediawiki-dump的Python项目详细描述


mediawiki转储

Build Status

pip install mediawiki_dump

Python3 package用于处理MediaWiki XML content dumps

支持Wikipedia(BZ2压缩)和Wikia(7ZIP)内容转储。

依赖关系

为了读取7zip存档(由wikia的xml转储使用),您需要安装^{}

sudo apt install libarchive-dev

API

标记器

允许您清理WikiText:

frommediawiki_dump.tokenizerimportcleanclean('[[Foo|bar]] is a link')'bar is a link'

然后标记文本:

frommediawiki_dump.tokenizerimporttokenizetokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')['juni','varð','kunngjørt','at','Svínoyar','kommuna','verður','løgd','saman','við','Klaksvíkar','kommunu','eftir','komandi','bygdaráðsval']

转储读卡器

获取和分析转储(使用本地文件缓存):

frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderdump=WikipediaDump('fo')pages=DumpReader().read(dump)[page.titleforpageinpages][:10]['Main Page','Brúkari:Jon Harald Søby','Forsíða','Ormurin Langi','Regin smiður','Fyrimynd:InterLingvLigoj','Heimsyvirlýsingin um mannarættindi','Bólkur:Kvæði','Bólkur:Yrking','Kjak:Forsíða']

read方法为每个修订生成DumpEntry对象。

通过使用DumpReaderArticles类,您只能阅读文章页面:

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikipediaDump('fo')reader=DumpReaderArticles()pages=reader.read(dump)print([page.titleforpageinpages][:25])print(reader.get_dump_language())# fo

会给你:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from <https://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2>...
INFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']

阅读wikia的转储文件

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('plnordycka')pages=DumpReaderArticles().read(dump)print([page.titleforpageinpages][:25])

会给你:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from <https://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z>...
INFO:WikiaDump:HTTP 200 (129 kB will be fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']

获取完整历史记录

full_history传递给BaseDump构造函数以获取具有完整历史记录的XML内容转储:

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('macbre',full_history=True)# fetch full history, including old revisionspages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))

会给你:

INFO:DumpReaderArticles:Parsing completed, entries found: 384
<DumpEntry "Macbre Wiki" by Default at 2016-10-12T19:51:06+00:00>
<DumpEntry "Macbre Wiki" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2016-11-04T10:33:20+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2016-11-04T10:37:17+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2017-01-25T14:47:37+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:20:25+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:21:20+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2018-03-07T12:51:12+00:00>
<DumpEntry "Main Page" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:33+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:49+00:00>
...
<DumpEntry "YouTube tag" by FANDOMbot at 2018-06-05T11:45:44+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-06T08:51:24+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:13+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:36+00:00>
<DumpEntry "Scary transclusion" by Macbre at 2018-07-24T14:52:20+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:04:15+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:24+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:37+00:00>

阅读选定文章的转储文件

你可以使用^{} Python library 并从任何mediawiki支持的站点获取所选文章的“实时”转储。

importmwclientsite=mwclient.Site('vim.fandom.com',path='/')frommediawiki_dump.dumpsimportMediaWikiClientDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=MediaWikiClientDump(site,['Vim documentation','Tutorial'])pages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))

会给你:

<DumpEntry "Vim documentation" by Anonymous at 2019-07-05T09:39:47+00:00>
<DumpEntry "Tutorial" by Anonymous at 2019-07-05T09:41:19+00:00>

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java在ArrayList中比较数字   java在Kotlin中使异步调用同步   让“Scala编程”junit示例在IntelliJ中工作的java问题   java Servlet侦听器未在ContextListener中设置属性   将Microsoft SQL Server数据库连接到我的Java项目   加载资源时出现java“需要注册工厂”异常   java如何使用POI检查excel中的重复记录?   java如何更改机器生成的代码   java如何确保重写的方法是同步的   用Spring编写Hibernate时的java XML奥秘   java管理mysql数据库中存储的用户权限   java如何运行。来自Javascript的jar方法   java我想在Web应用程序中进行身份验证&对桌面应用程序使用相同的凭据。我该怎么做?