hdx python实用程序
hdx-python-utilities的Python项目详细描述
hdx python实用程序库提供了一系列有用的实用程序:
- Easy downloading of files with support for authentication, streaming and hashing
- Loading and saving JSON and YAML (inc. with OrderedDict)
- Database utilities (inc. connecting through SSH and SQLAlchemy helpers)
- Dictionary and list utilities
- HTML utilities (inc. BeautifulSoup helper)
- Compare files (eg. for testing)
- Simple emailing
- Easy logging setup
- Path utilities
- Text processing
- Py3-like raise from for Py2
- Check valid UUID
- Easy building and packaging
这个库是Humanitarian Data Exchange(hdx)项目的一部分。如果你有 人道主义相关数据,请上传到HDX。
用法
库中有详细的api文档,可以在这里找到:http://ocha-dap.github.io/hdx-python-utilities/。 库的代码在这里:https://github.com/ocha-dap/hdx-python-utilities。
下载文件
帮助下载文件的各种实用程序。默认情况下包括重试。
例如,给定yaml文件extraparams.yml:
mykey:
basic_auth: "XXXXXXXX"
locale: "en"
我们可以创建一个下载程序,如下所示,它将使用basic-auth中定义的身份验证并添加参数 locale=en到每个请求(例如,对于get请求http://myurl/lala?param1=p1&locale=en):
with Download(user_agent='test', extra_params_yaml='extraparams.yml', extra_params_lookup='mykey') as downloader:
response = downloader.download(url) # get requests library response
json = response.json()
# Download file to folder/filename
f = downloader.download_file('http://myurl', post=False,
parameters=OrderedDict([('b', '4'), ('d', '3')]),
folder=tmpdir, filename=filename)
filepath = abspath(f)
# Read row by row from tabular file
for row in downloader.get_tabular_rows('http://myurl/my.csv', dict_rows=True, headers=1)
a = row['col']
如果我们想要一个用户代理,它将用于所有相关的hdx python实用程序方法(以及所有hdx python api方法 如果包含该库),则可以配置一次并自动使用:
UserAgent.set_global('test')
with Download() as downloader:
response = downloader.download(url) # get requests library response
其他有用功能:
# Build get url from url and dictionary of parameters
Download.get_url_for_get('http://www.lala.com/hdfa?a=3&b=4',
OrderedDict([('c', 'e'), ('d', 'f')]))
# == 'http://www.lala.com/hdfa?a=3&b=4&c=e&d=f'
# Extract url and dictionary of parameters from get url
Download.get_url_params_for_post('http://www.lala.com/hdfa?a=3&b=4',
OrderedDict([('c', 'e'), ('d', 'f')]))
# == ('http://www.lala.com/hdfa',
OrderedDict([('a', '3'), ('b', '4'), ('c', 'e'), ('d', 'f')]))
加载和保存json和yaml
示例:
# Load YAML
mydict = load_yaml('my_yaml.yml')
# Load 2 YAMLs and merge into dictionary
mydict = load_and_merge_yaml('my_yaml1.yml', 'my_yaml2.yml')
# Load YAML into existing dictionary
mydict = load_yaml_into_existing_dict(existing_dict, 'my_yaml.yml')
# Load JSON
mydict = load_json('my_json.yml')
# Load 2 JSONs and merge into dictionary
mydict = load_and_merge_json('my_json1.json', 'my_json2.json')
# Load JSON into existing dictionary
mydict = load_json_into_existing_dict(existing_dict, 'my_json.json')
# Save dictionary to YAML file in pretty format
# preserving order if it is an OrderedDict
save_yaml(mydict, 'mypath.yml', pretty=True, sortkeys=False)
# Save dictionary to JSON file in compact form
# sorting the keys
save_json(mydict, 'mypath.json', pretty=False, sortkeys=False)
数据库实用程序
这些是在sqlalchemy的基础上构建的,并简化了它的设置。
sqlalchemy数据库表必须继承自 hdx.utilities.database例如:
from hdx.utilities.database import Base
class MyTable(Base):
my_col = Column(Integer, ForeignKey(MyTable2.col2), primary_key=True)
示例:
# Get SQLAlchemy session object given database parameters and
# if needed SSH parameters. If database is PostgreSQL, will poll
# till it is up.
with Database(database='db', host='1.2.3.4', username='user', password='pass',
driver='driver', ssh_host='5.6.7.8', ssh_port=2222,
ssh_username='sshuser', ssh_private_key='path_to_key') as session:
session.query(...)
# Extract dictionary of parameters from SQLAlchemy url
result = Database.get_params_from_sqlalchemy_url(TestDatabase.sqlalchemy_url)
# Build SQLAlchemy url from dictionary of parameters
result = Database.get_sqlalchemy_url(**TestDatabase.params)
# Wait util PostgreSQL is up
Database.wait_for_postgres('mydatabase', 'myserver', 5432, 'myuser', 'mypass')
字典和列表实用程序
示例:
# Merge dictionaries
d1 = {1: 1, 2: 2, 3: 3, 4: ['a', 'b', 'c']}
d2 = {2: 6, 5: 8, 6: 9, 4: ['d', 'e']}
result = merge_dictionaries([d1, d2])
assert result == {1: 1, 2: 6, 3: 3, 4: ['d', 'e'], 5: 8, 6: 9}
# Diff dictionaries
d1 = {1: 1, 2: 2, 3: 3, 4: {'a': 1, 'b': 'c'}}
d2 = {4: {'a': 1, 'b': 'c'}, 2: 2, 3: 3, 1: 1}
diff = dict_diff(d1, d2)
assert diff == {}
d2[3] = 4
diff = dict_diff(d1, d2)
assert diff == {3: (3, 4)}
# Add element to list in dict
d = dict()
dict_of_lists_add(d, 'a', 1)
assert d == {'a': [1]}
dict_of_lists_add(d, 2, 'b')
assert d == {'a': [1], 2: ['b']}
dict_of_lists_add(d, 'a', 2)
assert d == {'a': [1, 2], 2: ['b']}
# Spread items in list so similar items are further apart
input_list = [3, 1, 1, 1, 2, 2]
result = list_distribute_contents(input_list)
assert result == [1, 2, 1, 2, 1, 3]
# Get values for the same key in all dicts in list
input_list = [{'key': 'd', 1: 5}, {'key': 'd', 1: 1}, {'key': 'g', 1: 2},
{'key': 'a', 1: 2}, {'key': 'a', 1: 3}, {'key': 'b', 1: 5}]
result = extract_list_from_list_of_dict(input_list, 'key')
assert result == ['d', 'd', 'g', 'a', 'a', 'b']
# Cast either keys or values or both in dictionary to type
d1 = {1: 2, 2: 2.0, 3: 5, 'la': 4}
assert key_value_convert(d1, keyfn=int) == {1: 2, 2: 2.0, 3: 5, 'la': 4}
assert key_value_convert(d1, keyfn=int, dropfailedkeys=True) == {1: 2, 2: 2.0, 3: 5}
d1 = {1: 2, 2: 2.0, 3: 5, 4: 'la'}
assert key_value_convert(d1, valuefn=int) == {1: 2, 2: 2.0, 3: 5, 4: 'la'}
assert key_value_convert(d1, valuefn=int, dropfailedvalues=True) == {1: 2, 2: 2.0, 3: 5}
# Cast keys in dictionary to integer
d1 = {1: 1, 2: 1.5, 3.5: 3, '4': 4}
assert integer_key_convert(d1) == {1: 1, 2: 1.5, 3: 3, 4: 4}
# Cast values in dictionary to integer
d1 = {1: 1, 2: 1.5, 3: '3', 4: 4}
assert integer_value_convert(d1) == {1: 1, 2: 1, 3: 3, 4: 4}
# Cast values in dictionary to float
d1 = {1: 1, 2: 1.5, 3: '3', 4: 4}
assert float_value_convert(d1) == {1: 1.0, 2: 1.5, 3: 3.0, 4: 4.0}
# Average values by key in two dictionaries
d1 = {1: 1, 2: 1.0, 3: 3, 4: 4}
d2 = {1: 2, 2: 2.0, 3: 5, 4: 4, 7: 3}
assert avg_dicts(d1, d2) == {1: 1.5, 2: 1.5, 3: 4, 4: 4}
# Read and write lists to csv
l = [[1, 2, 3, 'a'],
[4, 5, 6, 'b'],
[7, 8, 9, 'c']]
write_list_to_csv(l, filepath, headers=['h1', 'h2', 'h3', 'h4'])
newll = read_list_from_csv(filepath)
newld = read_list_from_csv(filepath, dict_form=True, headers=1)
assert newll == [['h1', 'h2', 'h3', 'h4'], ['1', '2', '3', 'a'], ['4', '5', '6', 'b'], ['7', '8', '9', 'c']]
assert newld == [{'h1': '1', 'h2': '2', 'h4': 'a', 'h3': '3'},
{'h1': '4', 'h2': '5', 'h4': 'b', 'h3': '6'},
{'h1': '7', 'h2': '8', 'h4': 'c', 'h3': '9'}]
## Convert command line arguments to dictionary
args = 'a=1,big=hello,1=3'
assert args_to_dict(args) == {'a': '1', 'big': 'hello', '1': '3'}
HTML实用程序
这些都是建立在美化组的基础上,并简化其设置。
示例:
# Get soup for url with optional kwarg downloader=Download() object
soup = get_soup('http://myurl', user_agent='test')
# user agent can be set globally using:
# UserAgent.set_global('test')
tag = soup.find(id='mytag')
# Get text of tag stripped of leading and trailing whitespace
# and newlines and with   replaced with space
result = get_text('mytag')
# Extract HTML table as list of dictionaries
result = extract_table(tabletag)
比较文件
比较两个文件:
result = compare_files(testfile1, testfile2)
# Result is of form eg.:
# ["- coal ,3 ,7.4 ,'needed'\n",
# '? ^\n',
# "+ coal ,1 ,7.4 ,'notneeded'\n",
# '? ^ +++\n']
电子邮件
设置和发送电子邮件的示例:
smtp_initargs = {
'host': 'localhost',
'port': 123,
'local_hostname': 'mycomputer.fqdn.com',
'timeout': 3,
'source_address': ('machine', 456),
}
username = 'user@user.com'
password = 'pass'
email_config_dict = {
'connection_type': 'ssl',
'username': username,
'password': password
}
email_config_dict.update(smtp_initargs)
recipients = ['larry@gmail.com', 'moe@gmail.com', 'curly@gmail.com']
subject = 'hello'
text_body = 'hello there'
html_body = """\
<html>
<head></head>
<body>
<p>Hi!<br>
How are you?<br>
Here is the <a href="https://www.python.org">link</a> you wanted.
</p>
</body>
</html>
"""
sender = 'me@gmail.com'
with Email(email_config_dict=email_config_dict) as email:
email.send(recipients, subject, text_body, sender=sender)
配置日志
该库提供彩色日志和一个简单的默认设置,这应该足以满足大多数情况。如果你愿意 从默认值更改日志配置,您将需要用参数调用setup_logging。
from hdx.utilities.easy_logging import setup_logging
...
logger = logging.getLogger(__name__)
setup_logging(KEYWORD ARGUMENTS)
keyword参数可以是:
Choose | Argument | Type | Value | Default |
---|---|---|---|---|
One of: | logging_config_dict | dict | Logging configuration dictionary | |
or | logging_config_json | str | Path to JSON Logging configuration | |
or | logging_config_yaml | str | Path to YAML Logging configuration | Library's internal logging_configuration.yml |
One of: | smtp_config_dict | dict | Email Logging configuration dictionary | |
or | smtp_config_json | str | Path to JSON Email Logging configuration | |
or | smtp_config_yaml | str | Path to YAML Email Logging configuration |
除非使用默认配置,否则不要提供smtp配置dict、smtp配置json或smtp配置yaml。 日志配置!
如果使用的是默认日志记录配置,则可以选择使用发送 如果发生严重错误,请通过提供smtp配置dict,smtp配置json或 smtp_config_yaml。以下是yaml文件的模板,可以作为smtp_config_yaml参数传递:
handlers:
error_mail_handler:
toaddrs: EMAIL_ADDRESSES
subject: "RUN FAILED: MY_PROJECT_NAME"
除非重写,否则默认smtp处理程序的邮件服务器mailhost是localhost和from地址 fromaddr是<;noreply@localhost>;。
要在文件中使用日志记录,只需将下面的行添加到 每个python文件:
logger = logging.getLogger(__name__)
然后像这样使用记录器:
logger.debug('DEBUG message')
logger.info('INFORMATION message')
logger.warning('WARNING message')
logger.error('ERROR message')
logger.critical('CRITICAL error message')
路径实用程序
示例:
# Gets temporary directory from environment variable
# TEMP_DIR and falls back to os function
temp_folder = get_temp_dir()
# Gets temporary directory from environment variable
# TEMP_DIR and falls back to os function,
# optionally appends the given folder, creates the
# folder and on exiting, deletes the folder
with temp_dir('papa') as tempdir:
...
# Get current directory of script
dir = script_dir(ANY_PYTHON_OBJECT_IN_SCRIPT)
# Get current directory of script with filename appended
path = script_dir_plus_file('myfile.txt', ANY_PYTHON_OBJECT_IN_SCRIPT)
文本处理
示例:
## Replace multiple strings in a string simultaneously
a = 'The quick brown fox jumped over the lazy dog. It was so fast!'
result = multiple_replace(a, {'quick': 'slow', 'fast': 'slow', 'lazy': 'busy'})
assert result == 'The slow brown fox jumped over the busy dog. It was so slow!'
# Extract words from a string sentence into a list
result = get_words_in_sentence("Korea (Democratic People's Republic of)")
assert result == ['Korea', 'Democratic', "People's", 'Republic', 'of']
# Find matching text in strings
a = 'The quick brown fox jumped over the lazy dog. It was so fast!'
b = 'The quicker brown fox leapt over the slower fox. It was so fast!'
c = 'The quick brown fox climbed over the lazy dog. It was so fast!'
result = get_matching_text([a, b, c], match_min_size=10)
assert result == ' brown fox over the It was so fast!'
从
示例:
# Raise an exception from another exception on Py2 or Py3
except IOError as e:
raisefrom(IOError, 'My Error Message', e)
有效Uuid
示例:
assert is_valid_uuid('jpsmith') is False
assert is_valid_uuid('c9bf9e57-1685-4c89-bafb-ff5af830be8a') is True
易于建造和包装
setup.py的clean命令已扩展为默认情况下使用--all标志并清除dist文件夹。 已创建两个新的命令文件夹。package调用新的clean命令,以及sdist和 bdist_轮。换言之,它彻底清理并构建源和轮分布。publish发布 到pypi并创建git标记,例如
python setup.py clean
python setup.py package
python setup.py publish
要使用这些命令,请创建一个setup.py是:
{夫人21}