一个多进程的web抓取应用程序,用于抓取wiki页面并在两个给定wiki页面之间找到最小数量的链接。
wikilink的Python项目详细描述
wikilink是一个多处理的web抓取应用程序,用于抓取wiki页面、提取url并查找两个给定wiki页面之间的最小链接数。
我在my blog中简要讨论了项目的动机和概述。
项目当前处于v0.3.0.post1版本,有关发布历史记录的详细信息,请参见change log。
Build | ||
---|---|---|
Quality | ||
Support | ||
Platform |
目录
用法
使用pip安装
$ pip install wikilink
数据库支持
wikilink目前支持Mysql和PostgreSQL
api
setup_db(db, username, password, ip="127.0.0.1", port=3306): set up database
Args:
db(str): Database engine, currently support "mysql" and "postgresql"
name(str): database username
password(str): database password
ip(str): IP address of database (Default = "127.0.0.1")
port(str): port that databse is running on (default=3306)
Returns:
None
min_link(source, destination, limit=6, multiprocessing=False): find minimum number of link from source url to destination url within limit
Args:
source(str): source wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
destination(str): Destination wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
limit(int): max number of links from the source that will be considered (default=6)
multiprocessing(boolean): enable/disable multiprocessing mode (default=False)
Returns:
(int) minimum number of sepration between source and destination urls
return None and print messages if exceeding limits or no path found
Raises:
DisconnectionError: error connecting to DB
示例
>>> from wikilink import WikiLink
>>> app = WikiLink()
>>> app.setup_db("mysql", "root", "12345", "127.0.0.1", "3306")
>>> source = "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
>>> destination = "https://en.wikipedia.org/wiki/Lionel_Messi"
>>> app.min_link(source, destination, 6)
1
贡献![Open Source Helpers](https://warehouse-camo.cmh1.psfhosted.org/08d6da5972b1ca05bfd45148badbed8e5250a05e/68747470733a2f2f7777772e636f64657472696167652e636f6d2f7472616e6c7976752f77696b692d6c696e6b2f6261646765732f75736572732e737667)
如何贡献
请遵循我们在contribution instructions和code of conduct的捐款约定。
要设置开发环境,只需运行:
$ pip install -r requirements.txt
请查看issue file以获取需要帮助的问题列表。
欣赏
请随意将您的姓名添加到list of contributors。你将自动进入名人堂,以此表达我对你贡献的感激之情。
名人堂
许可证
请参阅LICENSE文件以了解许可权和限制(apache许可证2.0)。