数据集线器
datahub的Python项目详细描述
数据中心
- datahub是一个允许更快下载/爬网、解析、加载和可视化数据的工具。它通过允许您将每个步骤划分为自己的工作文件夹来实现这一点。在每个工作文件夹中,您都会得到一个示例文件,可以开始在其中进行编码。
- datahub是为那些发现了一些有趣的数据源的人准备的,他们想下载、解析、加载到数据库中,提供一些文档,并将其可视化。datahub将通过为每个操作创建文件夹来加快进程。您将从我们的基本默认模板创建所有程序,并立即继续分析数据。
6数据隐私规则
- Sensitive, and possibly inaccurate, information may not be used against people in financial, political, employment, and health-care settings.
- All information should not be forcing anybody to hide or protect them self against improper information use that significantly limits persons ability to exercise his/her right to freedom of association.
- Implement a basic form of information accountability by tracking identifying information that identifies a person or corporation and could be used to held that person/corporation accountable for the compliance.
- There should be no restriction on use of data unless specified by laws and these privacy rules.
- Privacy is protected not by limiting the collection of data, but by placing strict rules on how the data may be used. Data that can be used in financial, political, employment, and health-care settings cannot be used for marketing and other profiling. Strict penalties should be imposed by for the breach of these use limitations. Actions that involve financial, political, employment, and health-care settings decision must be justified with reference to the specific data on which the decision was based. If the person/corporation discovers that the data is inaccurate, he or she may demand that it be corrected. Stiff financial penalties should be imposed against the agency that does not make the appropriate corrections.
- Achieve greater information accountability only by making better use of the information that is collected, retaining the data that is necessary to hold data users responsible for policy compliance. Build the system that encourages compliance, and maximizes the possibility of accountability of violations. Technology should supplant the rules because users are aware of what they are and because they know there will be consequences, after the fact.
安装数据中心
开始使用datahub的最佳方法是按以下方式安装它。 安装virtualenv,将安装保存在单独的目录中:
virtualenv --no-site-packages datahubENV New python executable in datahubENV/bin/python Installing setuptools............done. source datahubENV/bin/activate
创建基于数据中心的项目
datahub是一个粘贴模板,因此您可以按如下方式运行它:
paster create --list-templates paster create -t datahub
您应该看到这样的内容:
paster create -t datahub
选定模板和隐含模板:
PasteScript#basic_package A basic setuptools-enabled package datahub#datahub DataHub is a tool to help you datamine(crawl, parse, and load) any data. Enter project name: myproject Variables: egg: myproject package: myproject project: myproject Enter version (Version (like 0.1)) ['']: 0.1 Enter description (One-line description of the package) ['']: my project Enter long_description (Multi-line description (in reST)) ['']: this is long description Enter keywords (Space-separated keywords/tags) ['']: datahub dataprocess gov Enter author (Author name) ['']: myname Enter author_email (Author email) ['']: myemail Enter url (URL of homepage) ['']: mywebsite Enter license_name (License name) ['']: gpl Enter zip_safe (True/False: if the package can be distributed as a .zip file) [False]: Creating template basic_package Creating directory ./myproject Recursing into +package+ Creating ./myproject/myproject/ Copying __init__.py to ./myproject/myproject/__init__.py Copying setup.cfg to ./myproject/setup.cfg Copying setup.py_tmpl to ./myproject/setup.py Creating template datahub Recursing into +package+ Copying README.txt_tmpl to ./myproject/myproject/README.txt Recursing into crawl Creating ./myproject/myproject/crawl/ Copying Readme.txt_tmpl to ./myproject/myproject/crawl/Readme.txt Copying __init__.py to ./myproject/myproject/crawl/__init__.py Copying crawl.sh to ./myproject/myproject/crawl/crawl.sh Copying download.sh to ./myproject/myproject/crawl/download.sh Copying download_list.txt_tmpl to ./myproject/myproject/crawl/download_list.txt Copying harvestman-+package+.xml to ./myproject/myproject/crawl/harvestman-myproject.xml Recursing into hdf5 Creating ./myproject/myproject/hdf5/ Copying READEM_hdf5.txt_tmpl to ./myproject/myproject/hdf5/READEM_hdf5.txt Copying __init__.py to ./myproject/myproject/hdf5/__init__.py Recursing into load Creating ./myproject/myproject/load/ Copying __init__.py to ./myproject/myproject/load/__init__.py Copying load.py to ./myproject/myproject/load/load.py Copying load.sh to ./myproject/myproject/load/load.sh Copying model.py to ./myproject/myproject/load/model.py Recursing into parse Creating ./myproject/myproject/parse/ Copying __init__.py to ./myproject/myproject/parse/__init__.py Copying parse.sh_tmpl to ./myproject/myproject/parse/parse.sh Copying process.sh_tmpl to ./myproject/myproject/process.sh Recursing into wiki Creating ./myproject/myproject/wiki/ Copying REAME.wiki_tmpl to ./myproject/myproject/wiki/REAME.wiki Running /home/lucas/tmp/lmENV/bin/python setup.py egg_info Manually creating paster_plugins.txt (deprecated! pass a paster_plugins keyword to setup() instead) Adding datahub to paster_plugins.txt
进入myproject文件夹并开始编码。 文件夹结构如下:
myproject |-- myproject | |-- README.txt | |-- __init__.py | |-- crawl | | |-- Readme.txt | | |-- __init__.py | | |-- crawl.sh | | |-- download.sh | | |-- download_list.txt | | `-- harvestman-myproject.xml | |-- hdf5 | | |-- READEM_hdf5.txt | | `-- __init__.py | |-- load | | |-- __init__.py | | |-- load.py | | |-- load.sh | | `-- model.py | |-- parse | | |-- __init__.py | | `-- parse.sh | |-- process.sh | `-- wiki | `-- REAME.wiki |-- myproject.egg-info | |-- PKG-INFO | |-- SOURCES.txt | |-- dependency_links.txt | |-- entry_points.txt | |-- not-zip-safe | |-- paster_plugins.txt | `-- top_level.txt |-- setup.cfg `-- setup.py
关注您的数据项目
爬行
爬网文件夹是爬网数据的位置。你有两个选择下载。对于每个选项都有预构建文件,因此请遵循以下操作:
wget
使用wget,如果文件列表不大,您可以下载文件。有一个download_list.txt将保存您要下载的URL。您可以指定通配符,如*.zip,*.pdf,*.txt等。download.sh是一个调用wget并下载文件的shell脚本。默认情况下,它只下载比您下载的文件更新的文件,并且只下载丢失的部分。这样可以节省带宽,并且不会每次都重新下载整个文件。
您只需编辑download_list.txt即可:
cd crawl #Edit download_list.txt and add url of files you want to download vi download_list.txt sh download.sh
你的第二个选择是Harvestman,见文档。
解析
parse是解析文件的位置。这是你控制的灰色地带。这可以简单到解压文件,或者编写一个简单的脚本来替换一些名称,或者编写一些扩展的解析程序。这完全取决于项目数据。将代码添加到parse.sh,或编写自己的解析器并将正在运行的代码添加到parse.sh,以便稍后您只需运行:
sh parse.sh
负荷
load是将数据加载到数据库的位置。有一个laod.py文件有一个4列的示例数据库结构。你可以用那个文件作为起点。它拥有从定义新列到设置数据库、读取parse文件夹中的csv文件并将其上传到数据库的所有功能。读取load.sh和load.py文件,并使其变得特殊,而不是它所说的[change]。这些是您更改名称、添加列、告诉它文件在哪里的部分。全部完成后,运行它:
sh load.sh
处理.sh
在所有文件夹中,最重要的是一个名为process.sh的文件。此文件具有一个内置结构,可以进入crawl文件夹并启动crawl.sh,然后进入parse并运行parse.sh,然后进入load并运行load.sh脚本。用这个文件你可以控制整个过程。当一切就绪时,用户可以获取您的项目,安装任何必要的程序,只需运行:
sh process.sh
这将抓取、解析和加载数据。
享受吧。