如何从某篇文章获取完整的维基百科修订历史记录列表？

2条回答

网友

1楼 · 编辑于 2024-09-28 18:58:12

如果您需要500多个修订条目，则必须使用MediaWiki API和action查询、属性修订和参数rvcontinue，这是从上一个请求中获取的，因此您不能仅通过一个请求获得整个列表：

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...

要获得您选择的更具体的信息，还必须使用rvprop参数：

^{pr2}$

可以找到here的所有可用参数的摘要。在

以下是如何在C中获取完整的Wikipedia页面修订历史记录：

private static List<XElement> GetRevisions(string pageTitle)
{
    var url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle;
    var revisions = new List<XElement>();
    var next = string.Empty;
    while (true)
    {
        using (var webResponse = (HttpWebResponse)WebRequest.Create(url + next).GetResponse())
        {
            using (var reader = new StreamReader(webResponse.GetResponseStream()))
            {
                var xElement = XElement.Parse(reader.ReadToEnd());
                revisions.AddRange(xElement.Descendants("rev"));

                var cont = xElement.Element("continue");
                if (cont == null) break;

                next = "&rvcontinue=" + cont.Attribute("rvcontinue").Value;
            }
        }
    }

    return revisions;
}

目前对于“Coffee”，此返回值10414修订。在

编辑：以下是Python版本：

import urllib2
import re

def GetRevisions(pageTitle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request
    while True:
        response = urllib2.urlopen(url + next).read()     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return revisions;

你对逻辑的看法完全相同。与C的不同之处在于，在C中，我解析了XML响应，这里我使用regex来匹配其中的所有rev和{}元素。在

所以，我的想法是做一个main request，从中我得到所有修订（最大值是500）到revisions数组中。我还检查了continuexml元素，以了解是否还有更多的修订，获取rvcontinue属性的值，并在next变量中使用它（对于本例中的第一个请求，它是20150127211200|644458070），使another request接受下一个500个修订。我重复这一切，直到continue元素可用为止。如果它丢失了，这意味着在响应的修订列表中的最后一个修订之后不再有修订，所以我退出循环。在

revisions = GetRevisions("Coffee")

print(len(revisions))
#10418

下面是“Coffee”文章的最后10个修订（它们是从API中以相反的顺序返回的），不要忘了，如果需要更具体的修订信息，可以在请求中使用rvprop参数。在

for i in revisions[0:10]:
    print(i)

#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />

网友

2楼 · 编辑于 2024-09-28 18:58:12

如果您使用pywikibot，您可以拉一个生成器，它将为您运行完整的修订历史记录。例如，要获得一个生成器，该生成器将逐步检查英文Wikipedia中页面“pagename”的所有修订（包括其内容），请使用：

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "pagename")
revs = page.revisions(content=True)

还有很多参数可以应用于查询。您可以找到API文档here

值得注意的是：

revisions(reverse=False, total=None, content=False, rollback=False, starttime=None, endtime=None)
Generator which loads the version history as Revision instances.

pywikibot似乎是许多wikipedia编辑自动化编辑的方法。在

相关问题更多 >

编程相关推荐

热门问题

热门文章