Python PySubstringSearch包_程序模块 - PyPI

用C++编写后缀数组算法的快速子串/模式搜索Python库

PySubstringSearch的Python项目详细描述

用C++编写后缀数组算法的快速子串/模式搜索Python库

license Python Build

关于项目

PySubstringSearch是一个用于在索引文件中搜索子字符串模式的库。该库是用C++编写的，实现速度和效率。该库还使用Msufsort后缀数组构造库进行字符串索引。创建的索引由原始文本和32位后缀数组结构组成。该库依赖于一个专有的容器协议来保存原始文本以及512mb的索引块，以避免后缀数组构造实现的限制。在

该模块实现了两种方法：search_sequential和search_parallel。search_sequential逐个搜索内部块，其中search_parallel同时搜索。当处理大于1gb的大索引时，search_parallel的运行速度会更快。我建议将它们与结果索引一起检查，以找出哪一个更适合。在

使用

构建

Msufsort

性能

Library	Text Size	Function	Time	#Results	Improvement Factor
ripgrepy	500mb	Ripgrepy('text_one', '500mb').run().as_string.split('\n')	127 ms ± 694 µs per loop	12553	1.0x
PySubstringSearch	500mb	reader.search_sequential('text_one')	2.48 ms ± 53.4 µs per loop	12553	51.2x
PySubstringSearch	500mb	reader.search_parallel('text_one')	3.78 ms ± 350 µs per loop	12553	33.6x
ripgrepy	500mb	Ripgrepy('text_two', '500mb').run().as_string.split('\n')	127 ms ± 623 µs per loop	769	1.0x
PySubstringSearch	500mb	reader.search_sequential('text_two')	156 µs ± 916 ns per loop	769	814.0x
PySubstringSearch	500mb	reader.search_parallel('text_two')	251 µs ± 80.2 µs per loop	769	506.0x
ripgrepy	6gb	Ripgrepy('text_one', '6gb').run().as_string.split('\n')	1.38 s ± 3.82 ms	206884	1.0x
PySubstringSearch	6gb	reader.search_sequential('text_one')	93.7 ms ± 2.16 ms per loop	206884	15.3x
PySubstringSearch	6gb	reader.search_parallel('text_one')	34.3 ms ± 321 µs per loop	206884	40.5x
ripgrepy	6gb	Ripgrepy('text_two', '6gb').run().as_string.split('\n')	1.61 s ± 37.2 ms per loop	6921	1.0x
PySubstringSearch	6gb	reader.search_sequential('text_two')	2.22 ms ± 79.3 µs per loop	6921	725.2x
PySubstringSearch	6gb	reader.search_parallel('text_two')	1.38 ms ± 26 µs per loop	6921	1166.6x

先决条件

为了编译这个包，应该安装GCC&Python开发包。在

软呢帽

sudo dnf install python3-devel gcc-c++

Ubuntu 18.04版

^{pr2}$

安装

pip3 install PySubstringSearch

使用

创建索引

importpysubstringsearch# creating a new index file# if a file with this name is already exists, it will be overwrittenwriter=pysubstringsearch.Writer(index_file_path='output.idx',)# adding entries to the new indexwriter.add_entry('some short string')writer.add_entry('another but now a longer string')writer.add_entry('more text to add')# making sure the data is dumped to the filewriter.finalize()

在索引中搜索子字符串

importpysubstringsearch# opening an index file for searchingreader=pysubstringsearch.Reader(index_file_path='output.idx',)# lookup for a substring sequentiallyreader.search_sequential('short')>>>['some short string']# lookup for a substring sequentiallyreader.search_sequential('string')>>>['some short string','another but now a longer string']# lookup for a substring concurrentlyreader.search_parallel('short')>>>['some short string']# lookup for a substring concurrentlyreader.search_parallel('string')>>>['some short string','another but now a longer string']

许可证

根据麻省理工学院的许可证分发。有关详细信息，请参见LICENSE。在

联系人

加本大卫-gal@intsights.com

项目链接：https://github.com/Intsights/PySubstringSearch

欢迎加入QQ群-->： 979659372

PySubstringSearch 0.3.0

PySubstringSearch的Python项目详细描述

用C++编写后缀数组算法的快速子串/模式搜索Python库

目录

关于项目

使用

性能

先决条件

安装

使用

许可证

联系人

推荐PyPI第三方库

gimmebio

pymetal

thundra

userexit

id-phonenumbers

chemkin_g10

django-file-resubmit

remote-docker

readN

tflibs

xwklwwltestpackage

lissie

python-ev3dev

caustic.pants

nester-papertiger

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

PySubstringSearch 0.3.0

PySubstringSearch的Python项目详细描述

用C++编写后缀数组算法的快速子串/模式搜索Python库

目录

关于项目

使用

性能

先决条件

安装

使用

许可证

联系人

推荐PyPI第三方库

gimmebio

pymetal

thundra

userexit

id-phonenumbers

chemkin_g10

django-file-resubmit

remote-docker

readN

tflibs

xwklwwltestpackage

lissie

python-ev3dev

caustic.pants

nester-papertiger

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签