用于处理mediawiki xml内容转储的python包
mediawiki-dump的Python项目详细描述
mediawiki转储
pip install mediawiki_dump
Python3 package用于处理MediaWiki XML content dumps。
支持Wikipedia(BZ2压缩)和Wikia(7ZIP)内容转储。
依赖关系
为了读取7zip存档(由wikia的xml转储使用),您需要安装^{
sudo apt install libarchive-dev
API
标记器
允许您清理WikiText:
frommediawiki_dump.tokenizerimportcleanclean('[[Foo|bar]] is a link')'bar is a link'
然后标记文本:
frommediawiki_dump.tokenizerimporttokenizetokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')['juni','varð','kunngjørt','at','Svínoyar','kommuna','verður','løgd','saman','við','Klaksvíkar','kommunu','eftir','komandi','bygdaráðsval']
转储读卡器
获取和分析转储(使用本地文件缓存):
frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderdump=WikipediaDump('fo')pages=DumpReader().read(dump)[page.titleforpageinpages][:10]['Main Page','Brúkari:Jon Harald Søby','Forsíða','Ormurin Langi','Regin smiður','Fyrimynd:InterLingvLigoj','Heimsyvirlýsingin um mannarættindi','Bólkur:Kvæði','Bólkur:Yrking','Kjak:Forsíða']
read
方法为每个修订生成DumpEntry
对象。
通过使用DumpReaderArticles
类,您只能阅读文章页面:
importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikipediaDump('fo')reader=DumpReaderArticles()pages=reader.read(dump)print([page.titleforpageinpages][:25])print(reader.get_dump_language())# fo
会给你:
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from <https://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2>...
INFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']
阅读wikia的转储文件
importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('plnordycka')pages=DumpReaderArticles().read(dump)print([page.titleforpageinpages][:25])
会给你:
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from <https://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z>...
INFO:WikiaDump:HTTP 200 (129 kB will be fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']
获取完整历史记录
将full_history
传递给BaseDump
构造函数以获取具有完整历史记录的XML内容转储:
importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('macbre',full_history=True)# fetch full history, including old revisionspages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))
会给你:
INFO:DumpReaderArticles:Parsing completed, entries found: 384
<DumpEntry "Macbre Wiki" by Default at 2016-10-12T19:51:06+00:00>
<DumpEntry "Macbre Wiki" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2016-11-04T10:33:20+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2016-11-04T10:37:17+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2017-01-25T14:47:37+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:20:25+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:21:20+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2018-03-07T12:51:12+00:00>
<DumpEntry "Main Page" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:33+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:49+00:00>
...
<DumpEntry "YouTube tag" by FANDOMbot at 2018-06-05T11:45:44+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-06T08:51:24+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:13+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:36+00:00>
<DumpEntry "Scary transclusion" by Macbre at 2018-07-24T14:52:20+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:04:15+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:24+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:37+00:00>
阅读选定文章的转储文件
你可以使用^{
importmwclientsite=mwclient.Site('vim.fandom.com',path='/')frommediawiki_dump.dumpsimportMediaWikiClientDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=MediaWikiClientDump(site,['Vim documentation','Tutorial'])pages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))
会给你:
<DumpEntry "Vim documentation" by Anonymous at 2019-07-05T09:39:47+00:00>
<DumpEntry "Tutorial" by Anonymous at 2019-07-05T09:41:19+00:00>