


  • datahub是一个允许更快下载/爬网、解析、加载和可视化数据的工具。它通过允许您将每个步骤划分为自己的工作文件夹来实现这一点。在每个工作文件夹中,您都会得到一个示例文件,可以开始在其中进行编码。
  • datahub是为那些发现了一些有趣的数据源的人准备的,他们想下载、解析、加载到数据库中,提供一些文档,并将其可视化。datahub将通过为每个操作创建文件夹来加快进程。您将从我们的基本默认模板创建所有程序,并立即继续分析数据。


  1. Sensitive, and possibly inaccurate, information may not be used against people in financial, political, employment, and health-care settings.
  2. All information should not be forcing anybody to hide or protect them self against improper information use that significantly limits persons ability to exercise his/her right to freedom of association.
  3. Implement a basic form of information accountability by tracking identifying information that identifies a person or corporation and could be used to held that person/corporation accountable for the compliance.
  4. There should be no restriction on use of data unless specified by laws and these privacy rules.
  5. Privacy is protected not by limiting the collection of data, but by placing strict rules on how the data may be used. Data that can be used in financial, political, employment, and health-care settings cannot be used for marketing and other profiling. Strict penalties should be imposed by for the breach of these use limitations. Actions that involve financial, political, employment, and health-care settings decision must be justified with reference to the specific data on which the decision was based. If the person/corporation discovers that the data is inaccurate, he or she may demand that it be corrected. Stiff financial penalties should be imposed against the agency that does not make the appropriate corrections.
  6. Achieve greater information accountability only by making better use of the information that is collected, retaining the data that is necessary to hold data users responsible for policy compliance. Build the system that encourages compliance, and maximizes the possibility of accountability of violations. Technology should supplant the rules because users are aware of what they are and because they know there will be consequences, after the fact.


开始使用datahub的最佳方法是按以下方式安装它。 安装virtualenv,将安装保存在单独的目录中:

virtualenv --no-site-packages datahubENV
New python executable in datahubENV/bin/python
Installing setuptools............done.

source datahubENV/bin/activate



paster create --list-templates
paster create -t datahub


paster create -t datahub


 PasteScript#basic_package  A basic setuptools-enabled package
 datahub#datahub            DataHub is a tool to help you datamine(crawl, parse, and load) any data.

Enter project name: myproject
  egg:      myproject
  package:  myproject
  project:  myproject
Enter version (Version (like 0.1)) ['']: 0.1
Enter description (One-line description of the package) ['']: my project
Enter long_description (Multi-line description (in reST)) ['']: this is long description
Enter keywords (Space-separated keywords/tags) ['']: datahub dataprocess gov
Enter author (Author name) ['']: myname
Enter author_email (Author email) ['']: myemail
Enter url (URL of homepage) ['']: mywebsite
Enter license_name (License name) ['']: gpl
Enter zip_safe (True/False: if the package can be distributed as a .zip file) [False]:
Creating template basic_package
Creating directory ./myproject
  Recursing into +package+
    Creating ./myproject/myproject/
    Copying __init__.py to ./myproject/myproject/__init__.py
  Copying setup.cfg to ./myproject/setup.cfg
  Copying setup.py_tmpl to ./myproject/setup.py
Creating template datahub
  Recursing into +package+
    Copying README.txt_tmpl to ./myproject/myproject/README.txt
    Recursing into crawl
      Creating ./myproject/myproject/crawl/
      Copying Readme.txt_tmpl to ./myproject/myproject/crawl/Readme.txt
      Copying __init__.py to ./myproject/myproject/crawl/__init__.py
      Copying crawl.sh to ./myproject/myproject/crawl/crawl.sh
      Copying download.sh to ./myproject/myproject/crawl/download.sh
      Copying download_list.txt_tmpl to ./myproject/myproject/crawl/download_list.txt
      Copying harvestman-+package+.xml to ./myproject/myproject/crawl/harvestman-myproject.xml
    Recursing into hdf5
      Creating ./myproject/myproject/hdf5/
      Copying READEM_hdf5.txt_tmpl to ./myproject/myproject/hdf5/READEM_hdf5.txt
      Copying __init__.py to ./myproject/myproject/hdf5/__init__.py
    Recursing into load
      Creating ./myproject/myproject/load/
      Copying __init__.py to ./myproject/myproject/load/__init__.py
      Copying load.py to ./myproject/myproject/load/load.py
      Copying load.sh to ./myproject/myproject/load/load.sh
      Copying model.py to ./myproject/myproject/load/model.py
    Recursing into parse
      Creating ./myproject/myproject/parse/
      Copying __init__.py to ./myproject/myproject/parse/__init__.py
      Copying parse.sh_tmpl to ./myproject/myproject/parse/parse.sh
    Copying process.sh_tmpl to ./myproject/myproject/process.sh
    Recursing into wiki
      Creating ./myproject/myproject/wiki/
      Copying REAME.wiki_tmpl to ./myproject/myproject/wiki/REAME.wiki
Running /home/lucas/tmp/lmENV/bin/python setup.py egg_info
Manually creating paster_plugins.txt (deprecated! pass a paster_plugins keyword to setup() instead)
Adding datahub to paster_plugins.txt

进入myproject文件夹并开始编码。 文件夹结构如下:

|-- myproject
|   |-- README.txt
|   |-- __init__.py
|   |-- crawl
|   |   |-- Readme.txt
|   |   |-- __init__.py
|   |   |-- crawl.sh
|   |   |-- download.sh
|   |   |-- download_list.txt
|   |   `-- harvestman-myproject.xml
|   |-- hdf5
|   |   |-- READEM_hdf5.txt
|   |   `-- __init__.py
|   |-- load
|   |   |-- __init__.py
|   |   |-- load.py
|   |   |-- load.sh
|   |   `-- model.py
|   |-- parse
|   |   |-- __init__.py
|   |   `-- parse.sh
|   |-- process.sh
|   `-- wiki
|       `-- REAME.wiki
|-- myproject.egg-info
|   |-- PKG-INFO
|   |-- SOURCES.txt
|   |-- dependency_links.txt
|   |-- entry_points.txt
|   |-- not-zip-safe
|   |-- paster_plugins.txt
|   `-- top_level.txt
|-- setup.cfg
`-- setup.py







cd crawl
#Edit download_list.txt and add url of files you want to download
vi download_list.txt
sh download.sh




sh parse.sh



sh load.sh



sh process.sh



欢迎加入QQ群-->: 979659372 Python中文网_新手群


java在JavaFX TableView中设置行高   java生成范围内的随机数   ProcessBuilder或DefaultExecutor启动的“RunAs”子进程的java读取标准输出   java ExoPlayer播放多个视频   基于匹配器的java Mockito ArgumentCaptor捕获条件   java正在创建更新程序。更新/下载部分   java请求。getAttribute()在servlet中不起作用   java Android Http请求:我不理解以下代码:   java ArrayList Failfast ConcurrentModificationException   if语句Java/LWJGL Pong AI问题   使用Oracle 10g长字段填充Grails域对象时的java空指针   当用户触摸屏时,java按钮不从左上角移动到右下角   未捕获javasocket读取IOException?   用Java绘制一段圆的几何图形?