获取大量（但不是全部）维基百科页面

# Get 10000 random pages from Wikipedia. import urllib2 import os import shutil #Make the directory to store the HTML pages. print "Deleting the old randompages directory" shutil.rmtree('randompages') print "Created the directory for storing the pages" os.mkdir('randompages') num_page = raw_input('Number of pages to retrieve:: ') for i in range(0, int(num_page)): opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/wiki/Special:Random') page = infile.read() # Write it to a file. # TODO: Strip HTML from page f= open('randompages/file'+str(i)+'.html','w') f.write(page) f.close() print "Retrieved and saved page",i+1

3条回答

网友

1楼 · 编辑于 2024-10-01 09:29:33

我会走相反的路——从XML转储开始，然后扔掉你不想要的东西。在

在您的例子中，如果您希望进行自然语言处理，我假设您对具有完整句子的页面感兴趣，而不是链接列表。如果你按照你描述的方式搜索链接，你会找到很多链接页面。在

既然使用XML解析工具可以简化选择过程，那么为什么要避免使用XML呢？在

网友

2楼 · 编辑于 2024-10-01 09:29:33

for i = 1 to 10000
    get "http://en.wikipedia.org/wiki/Special:Random"

网友

3楼 · 编辑于 2024-10-01 09:29:33

维基百科有一个API。使用此API，您可以获取给定命名空间中的任意文章：

http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=5

对于你打电话来的每一篇文章，也会收到维基文本：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章