<p>如果您需要500多个修订条目,则必须使用<a href="https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions" rel="noreferrer">MediaWiki API</a>和action<strong>查询</strong>、属性<strong>修订</strong>和参数<strong>rvcontinue</strong>,这是从上一个请求中获取的,因此您不能仅通过一个请求获得整个列表:</p>
<pre><code>https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...
</code></pre>
<p>要获得您选择的更具体的信息,还必须使用<strong>rvprop</strong>参数:</p>
^{pr2}$
<p>可以找到<a href="https://www.mediawiki.org/wiki/API:Revisions" rel="noreferrer">here</a>的所有可用参数的摘要。在</p>
<p>以下是如何在C中获取完整的Wikipedia页面修订历史记录:</p>
<pre><code>private static List<XElement> GetRevisions(string pageTitle)
{
var url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle;
var revisions = new List<XElement>();
var next = string.Empty;
while (true)
{
using (var webResponse = (HttpWebResponse)WebRequest.Create(url + next).GetResponse())
{
using (var reader = new StreamReader(webResponse.GetResponseStream()))
{
var xElement = XElement.Parse(reader.ReadToEnd());
revisions.AddRange(xElement.Descendants("rev"));
var cont = xElement.Element("continue");
if (cont == null) break;
next = "&rvcontinue=" + cont.Attribute("rvcontinue").Value;
}
}
}
return revisions;
}
</code></pre>
<p>目前对于<em>“Coffee”</em>,此返回值<strong>10414</strong>修订。在</p>
<hr/>
<p><strong>编辑:</strong>以下是Python版本:</p>
<pre><code>import urllib2
import re
def GetRevisions(pageTitle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib2.urlopen(url + next).read() #web request
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions;
</code></pre>
<p>你对逻辑的看法完全相同。与C的不同之处在于,在C中,我解析了XML响应,这里我使用regex来匹配其中的所有<code>rev</code>和{<cd2>}元素。在</p>
<p>所以,我的想法是做一个<a href="https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=Coffee" rel="noreferrer">main request</a>,从中我得到所有修订(最大值是500)到<code>revisions</code>数组中。我还检查了<code>continue</code>xml元素,以了解是否还有更多的修订,获取<code>rvcontinue</code>属性的值,并在<code>next</code>变量中使用它(对于本例中的第一个请求,它是<code>20150127211200|644458070</code>),使<a href="https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=Coffee&rvcontinue=20150127211200%7C644458070" rel="noreferrer">another request</a>接受下一个500个修订。我重复这一切,直到<code>continue</code>元素可用为止。如果它丢失了,这意味着在响应的修订列表中的最后一个修订之后不再有修订,所以我退出循环。在</p>
<pre><code>revisions = GetRevisions("Coffee")
print(len(revisions))
#10418
</code></pre>
<p>下面是<em>“Coffee”</em>文章的最后10个修订(它们是从API中以相反的顺序返回的),不要忘了,如果需要更具体的修订信息,可以在请求中使用<code>rvprop</code>参数。在</p>
<pre><code>for i in revisions[0:10]:
print(i)
#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />
</code></pre>