流式处理大文件的实用程序(s3、hdfs、gzip、bz2…)

smart-open的Python项目详细描述


什么?

smart庠open 是一个python 2&python 3库,用于从/到s3、hdfs、webhdfs、http或本地存储的非常大的文件的高效流式传输。它支持各种不同格式的透明动态(动态)压缩。

smart-open 是python内置的 open()的替代品:它可以做任何事情 open 可以(100%兼容,尽可能回到原生的 open ),加上许多漂亮的附加功能。

smart\u open经过了良好的测试,有很好的文档记录,并且有一个简单的pythonic api:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'

其他 智能打开的URL示例 接受:

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
[ssh|scp|sftp]://username@host/path/file

有关详细的API信息,请参阅联机帮助:

help('smart_open')

或单击此处的 查看浏览器中的帮助。

更多示例:

>>>importboto3>>>>>># stream content *into* S3 (write mode) using a custom session>>>url='s3://smart-open-py37-benchmark-results/test.txt'>>>lines=[b'first line\n',b'second line\n',b'third line\n']>>>transport_params={'session':boto3.Session(profile_name='smart_open')}>>>withopen(url,'wb',transport_params=transport_params)asfout:...forlineinlines:...bytes_written=fout.write(line)
# stream from HDFSforlineinopen('hdfs://user/hadoop/my_file.txt',encoding='utf8'):print(line)# stream from WebHDFSforlineinopen('webhdfs://host:port/user/hadoop/my_file.txt'):print(line)# stream content *into* HDFS (write mode):withopen('hdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream content *into* WebHDFS (write mode):withopen('webhdfs://host:port/user/hadoop/my_file.txt','wb')asfout:fout.write(b'hello world')# stream from a completely custom s3 server, like s3proxy:forlineinopen('s3u://user:secret@host:port@mybucket/mykey.txt'):print(line)# Stream to Digital Ocean Spaces bucket providing credentials from boto profiletransport_params={'session':boto3.Session(profile_name='digitalocean'),'resource_kwargs':{'endpoint_url':'https://ams3.digitaloceanspaces.com',}}withopen('s3://bucket/key.txt','wb',transport_params=transport_params)asfout:fout.write(b'here we stand')
为什么?

使用amazon默认的python库使用大型s3文件时, boto boto3 是一种痛苦。 它的 键。从字符串()和 键设置内容。get_contents_as_string() 方法只适用于小文件(加载在RAM中,不流)。 在使用大型文件和大量样板文件所需的多部分上载功能时,会出现令人讨厌的隐藏问题。

智能打开 保护您不受影响。它建立在BOTO3的基础上,但是提供了一个更干净的pythonicAPI。其结果是编写的代码更少,生成的错误更少。

安装

pip install smart_open

或者,如果您希望从源tar.gz安装

python setup.py test  # run unit tests
python setup.py install

要运行单元测试(可选),还需要安装mock、moto和响应( pip install mock moto responses )。 测试也会在每次提交推拉请求时使用travis ci自动运行。

支持的压缩格式

smart_open允许读取和写入gzip和bzip2文件。 基于所打开文件的扩展名,它们也可以通过http、s3和其他协议进行透明处理。 您可以轻松添加对其他文件扩展名和压缩格式的支持。 例如,要打开xz压缩文件:

>>>importlzma,os>>>fromsmart_openimportopen,register_compressor>>>def_handle_xz(file_obj,mode):...returnlzma.LZMAFile(filename=file_obj,mode=mode,format=lzma.FORMAT_XZ)>>>register_compressor('.xz',_handle_xz)>>>withopen('smart_open/tests/test_data/crime-and-punishment.txt.xz')asfin:...text=fin.read()>>>print(len(text))1696

lzma 位于python 3.3及更高版本的标准库中。 对于2.7,使用backports.lzma>backports.lzma>

特定于传输的选项

智能打开 支持多种现成的传输选项,包括:

  • http,https(只读)
  • ssh、scp和sftp
  • webhdfs

每个选项都包括设置自己的参数集。 例如,对于访问s3,通常需要设置身份验证,如api密钥或配置文件名。 smart_open 's open 函数接受关键字参数transport_params ,该参数接受传输层的附加参数。 下面是一些使用此参数的示例:

>>>importboto3>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(session=boto3.Session()))>>>fin=open('s3://commoncrawl/robots.txt',transport_params=dict(buffer_size=1024))

有关每个传输选项支持的关键字参数的完整列表,请参阅文档:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
0

S3凭证

smart_open 使用 boto3 库与s3对话。 boto3 有几个确定要使用的凭据的机制。 默认情况下, smart_open 将遵从 boto3 并让后者处理凭证。 有几种方法可以覆盖此行为。

第一种方法是将一个 boto3.session 对象作为传输参数传递给 open 函数。 您可以在构造会话时自定义凭据。 智能打开 将在与S3通话时使用会话。

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
1

第二个选项是在s3 url本身中指定凭据:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
2

重要提示:以上两种方法是互斥的。如果您传递一个aws会话 并且 该url包含凭据, smart\u open 将忽略后者。

在s3存储桶的内容上迭代

由于检查S3存储桶中的所有(或选择)键是一个非常常见的操作,因此还有一个额外的功能可以有效地执行此操作,即并行处理存储桶键(使用多处理):

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
3

迁移到新的 打开功能

从1.8.1开始,有一个 smart-open.open 函数取代了 smart-open.smart-open 。 新功能比旧功能有几个优点:

  • 100%兼容内置的 open 函数(又称io.open ):它接受所有 内置 打开的 接受的参数。
  • 默认打开模式现在是"R",与内置的打开模式相同。 以前的smart_open.smart_open函数的默认值是"rb"。
  • 完整记录的关键字参数(请尝试 帮助("smart_open.open")

下面的说明将帮助您轻松地迁移到新功能。

首先,更新您的导入:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
4

一般来说,smart-open在可能的情况下直接使用io.open。 代码已经使用 open 作为本地文件I/O,然后它将继续工作。 如果要继续使用内置的 open 函数进行调试, 然后您可以导入smart_open并使用smart_open.open

默认读取模式现在为"R"(读取文本)。 如果代码隐式依赖于默认模式"rb"(读取 ,然后需要更新它并显式地传递"r"。

之前:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
5

之后:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
6

ignore_extension的关键字参数现在称为ignore_ext。 它的行为完全不同。

最重要的变化是对 传输层,如http、s3等。旧函数直接接受这些:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
7

新函数接受一个 transport_params 关键字参数。这是一个口述。 把你的传输参数放到字典里。

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
8

重命名参数:

  • s3上载 ->; 多部分上载
  • s3_会话 ->; 会话

删除的参数:

  • 配置文件名

配置文件名参数已被删除。 改为传递整个boto3.session对象。

之前:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
9

之后:

>>>fromsmart_openimportopen>>>>>># stream lines from an S3 object>>>forlineinopen('s3://commoncrawl/robots.txt'):...print(repr(line))...break'User-Agent: *\n'>>># stream from/to compressed files, with transparent (de)compression:>>>forlineinopen('smart_open/tests/test_data/1984.txt.gz',encoding='utf-8'):...print(repr(line))'It was a bright cold day in April, and the clocks were striking thirteen.\n''Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n''wind, slipped quickly through the glass doors of Victory Mansions, though not\n''quickly enough to prevent a swirl of gritty dust from entering along with him.\n'>>># can use context managers too:>>>withopen('smart_open/tests/test_data/1984.txt.gz')asfin:...withopen('smart_open/tests/test_data/1984.txt.bz2','w')asfout:...forlineinfin:...fout.write(line)>>># can use any IOBase operations, like seek>>>withopen('s3://commoncrawl/robots.txt','rb')asfin:...forlineinfin:...print(repr(line.decode('utf-8')))...break...offset=fin.seek(0)# seek to the beginning...print(fin.read(4))'User-Agent: *\n'b'User'>>># stream from HTTP>>>forlineinopen('http://example.com/index.html'):...print(repr(line))...break'<!doctype html>\n'
8

有关可接受参数名称的完整列表,请参见"帮助"("smart_open.open") , 或者在这里在线查看帮助。

如果传递的参数名无效,则smart_open.open 函数将对此发出警告。 注意您的日志中是否有来自Smart_Open的警告消息

评论、错误报告

smart-open 位于github上。你可以文件 在那里发布或拉取请求。建议,拉要求和改进欢迎!


smart-open 是在麻省理工学院许可证下发布的开源软件。 版权所有(c)2015 Now Radim_eh_ek

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Spring引导jar文件可重用(本地maven存储库)   没有标题的java Webpush通知不会出现   本地类的优先级。java文件还是从java包中导入的类?   java Sparks enableHiveSupport   java通过AJAX调用我的WebService通常会导致服务器故障   java添加到链表末尾   java提供了对Spring数据Mongo存储库的限制   仅显示字符的程序显示“?”在爪哇   java Oracle 10.2.0.4.0和OJDBC1212。1.0.0.jar | |无法获取JDBC连接;嵌套的异常是ja│ │ ORA01882:未找到时区区域   使用Netflix Eureka的java JSONException   java我们如何为akka非类型处理器编写单元测试   java 安卓在触摸和移动时获取按钮文本   java将字符串转换为int数组bluej