Python acora包_程序模块 - PyPI

文本字符串的快速多关键字搜索引擎

acora的Python项目详细描述

What is Acora?
Features
How do I use it?
FAQs and recipes
Changelog

What is Acora?

acora是python的“fgrep”，python是一个快速的多关键字文本搜索引擎。

基于一组关键字和 Aho-Corasick algorithm，它生成一个搜索自动机，并在字符串输入上运行，可以是unicode 或字节。

acora提供了纯python实现和快速二进制文件用cython编写的模块。但是，请注意当前的构造算法不适用于非常大的关键字集（即比几千块还多）。

你可以找到latest source code 在Github上。

要报告错误或请求新功能，请使用github bug tracker。请尝试提供在不需要太多要求的情况下复制问题的短测试用例实验或大量数据。越容易再现问题，越容易解决。

Features

适用于Unicode字符串和字节字符串
对于大多数输入，大约是python正则表达式引擎的2-3倍
查找重叠匹配项，即所有关键字的所有匹配项
支持不区分大小写的搜索（~10倍于're'）
搜索时释放gil
附加的（慢但短）纯python实现
支持Python2.5+和3.x
支持在文件中搜索
许可的BSD许可证

How do I use it?

导入包：

>>> from acora import AcoraBuilder

收集一些关键字：

>>> builder = AcoraBuilder('ab', 'bc', 'de')
>>> builder.add('a', 'b')

或：

>>> builder.update(['a', 'b'])  # new in version 2.0

为当前关键字集生成acora搜索引擎：

>>> ac = builder.build()

在字符串中搜索所有匹配项：

>>> ac.findall('abc')
[('a', 0), ('ab', 0), ('b', 1), ('bc', 1)]
>>> ac.findall('abde')
[('a', 0), ('ab', 0), ('b', 1), ('de', 2)]

在搜索结果传入时对其进行迭代：

>>> for kw, pos in ac.finditer('abde'):
...     print("%2s[%d]" % (kw, pos))
 a[0]
ab[0]
 b[1]
de[2]

acora还直接支持解析文件（二进制模式）：

>>> keywords = ['Import', 'FAQ', 'Acora', 'NotHere'.upper()]

>>> builder = AcoraBuilder([s.encode('ascii') for s in keywords])
>>> ac = builder.build()

>>> found = set(kw for kw, pos in ac.filefind('README.rst'))
>>> len(found)
3

>>> sorted(str(s.decode('ascii')) for s in found)
['Acora', 'FAQ', 'Import']

FAQs and recipes

如何贪婪地搜索最长的匹配关键字？

>>> builder = AcoraBuilder('a', 'ab', 'abc')
>>> ac = builder.build()

>>> for kw, pos in ac.finditer('abbabc'):
...     print(kw)
a
ab
a
ab
abc

>>> from itertools import groupby
>>> from operator import itemgetter

>>> def longest_match(matches):
...     for pos, match_set in groupby(matches, itemgetter(1)):
...         yield max(match_set)

>>> for kw, pos in longest_match(ac.finditer('abbabc')):
...     print(kw)
ab
abc

注意，这个配方假设搜索词没有与前缀重叠。

如何逐行解析任意行尾？

>>> def group_by_lines(s, *keywords):
...     builder = AcoraBuilder('\r', '\n', *keywords)
...     ac = builder.build()
...
...     current_line_matches = []
...     last_ending = None
...
...     for kw, pos in ac.finditer(s):
...         if kw in '\r\n':
...             if last_ending == '\r' and kw == '\n':
...                 continue # combined CRLF
...             yield tuple(current_line_matches)
...             del current_line_matches[:]
...             last_ending = kw
...         else:
...             last_ending = None
...             current_line_matches.append(kw)
...     yield tuple(current_line_matches)

>>> kwds = ['ab', 'bc', 'de']
>>> for matches in group_by_lines('a\r\r\nbc\r\ndede\n\nab', *kwds):
...     print(matches)
()
()
('bc',)
('de', 'de')
()
('ab',)

如何像fgrep那样找到包含关键字的整行？

>>> def match_lines(s, *keywords):
...     builder = AcoraBuilder('\r', '\n', *keywords)
...     ac = builder.build()
...
...     line_start = 0
...     matches = False
...     for kw, pos in ac.finditer(s):
...         if kw in '\r\n':
...             if matches:
...                  yield s[line_start:pos]
...                  matches = False
...             line_start = pos + 1
...         else:
...             matches = True
...     if matches:
...         yield s[line_start:]

>>> kwds = ['x', 'de', '\nstart']
>>> text = 'a line with\r\r\nsome text\r\ndede\n\nab\n start 1\nstart\n'
>>> for line in match_lines(text, *kwds):
...     print(line)
some text
dede
start

Changelog

2.2【2018-08-16】
- 更新以使用cython 0.29构建cpython 3.7。
2.1【2017-12-15】
- 修复对空引擎的处理（Github问题18）
2.0【2016-03-17】
- 重写构造算法以加快速度并节省内存
1.9【2015-10-10】
- 使用cython 0.23.4重新编译，以便更好地与最近的 python版本。
1.8【2014-02-12】
- 对预构建搜索引擎的pickle支持
- Builder中的性能优化
- Unicode解析针对Python3.3及更高版本进行了优化
- 安装cython后不再重新编译源，除非 --with-cython选项传递给setup.py（需要cython 0.20+）
- 最新cython版本的生成失败
- 使用cython 0.20.1构建
1.7【2011-08-24】
- 在二进制字符串中搜索字节值时>；127已断开
- 使用cython 0.15+
1.6【2011-07-24】
- 大大加快了自动生成速度
- 不再在源分布中包含.hg repo
- 使用cython 0.15（rc0）构建
1.5[2011-01-24]
- cython编译的nfa-2-dfa结构运行速度大大加快
- 即使未安装cython，也始终构建扩展模块
- --no-compile在setup.py中切换以防止扩展模块生成
- 使用cython 0.14.1（rc2）构建
1.4[2009-02-10]
- 在内部搜索引擎循环中小幅加速
- 一些代码清理
- 建成使用cython 0.12.1（最终版）
1.3[2009-01-30]
- 文件搜索的主要修复程序
- 使用cython 0.12.1（beta0）构建
1.2[2009-01-30]
- 对acorabuilder类的深度复制支持
- 文档/测试修复程序
- 在源代码分布中包含.hg repo
- 使用cython 0.12.1（beta0）构建
1.1[2009-01-29]
- 文档更新
- 一些清理
- 使用cython 0.12.1（beta0）构建
1.0【2009-01-29】
- 初始版本

欢迎加入QQ群-->： 979659372

acora 2.2

acora的Python项目详细描述

What is Acora?

Features

How do I use it?

FAQs and recipes

Changelog

推荐PyPI第三方库

bbog-sg-python-redis-lib

deem.testfixture

PyEM

didtoda

nanomonsv

motionSegmentation

bambooclatx

kahypar

xu-test-distributions

django-transmeta-eh

Rss-Feed-data

auto-cereb-nest-test

dapple_planner

ranking-metrics-torch

fosvis

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

acora 2.2

acora的Python项目详细描述

What is Acora?

Features

How do I use it?

FAQs and recipes

Changelog

推荐PyPI第三方库

bbog-sg-python-redis-lib

deem.testfixture

PyEM

didtoda

nanomonsv

motionSegmentation

bambooclatx

kahypar

xu-test-distributions

django-transmeta-eh

Rss-Feed-data

auto-cereb-nest-test

dapple_planner

ranking-metrics-torch

fosvis

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签