为neo4j维基百科页面图创建导入csv
wiki2neo的Python项目详细描述
wiki2neo
生成Neo4j从Wikipedia database dumps导入csv 建立维基百科页面之间的链接图。
安装
$ pip install wiki2neo
用法
Usage: wiki2neo [OPTIONS] [WIKI_XML_INFILE]
Parse Wikipedia pages-articles-multistream.xml dump into two Neo4j import
CSV files:
Node (Page) import, headers=["title:ID", "id"]
Relationships (Links) import, headers=[":START_ID", ":END_ID"]
Reads from stdin by default, pass [WIKI_XML_INFILE] to read from file.
Options:
-p, --pages-outfile FILENAME Node (Pages) CSV output file [default:pages.csv]
-l, --links-outfile FILENAME Relationships (Links) CSV output file [default: links.csv]
--help Show this message and exit.
Import resulting CSVs into Neo4j:
$ neo4j-admin import --nodes:Page pages.csv \
--relationships:LINKS_TO links.csv \
--ignore-duplicate-nodes --ignore-missing-nodes --multiline-fields
从维基百科下载的内容是压缩的xml.bz2
格式。最简单的用法是直接将提取输出pip到wiki2neo
:
$ bzcat pages-articles-multistream.xml.dz2 | wiki2neo