用C++编写后缀数组算法的快速子串/模式搜索Python库
PySubstringSearch的Python项目详细描述
用C++编写后缀数组算法的快速子串/模式搜索Python库
目录
关于项目
PySubstringSearch是一个用于在索引文件中搜索子字符串模式的库。该库是用C++编写的,实现速度和效率。该库还使用Msufsort后缀数组构造库进行字符串索引。创建的索引由原始文本和32位后缀数组结构组成。该库依赖于一个专有的容器协议来保存原始文本以及512mb的索引块,以避免后缀数组构造实现的限制。在
该模块实现了两种方法:search_sequential和search_parallel。search_sequential逐个搜索内部块,其中search_parallel同时搜索。当处理大于1gb的大索引时,search_parallel的运行速度会更快。我建议将它们与结果索引一起检查,以找出哪一个更适合。在
使用
构建性能
Library | Text Size | Function | Time | #Results | Improvement Factor |
---|---|---|---|---|---|
ripgrepy | 500mb | Ripgrepy('text_one', '500mb').run().as_string.split('\n') | 127 ms ± 694 µs per loop | 12553 | 1.0x |
PySubstringSearch | 500mb | reader.search_sequential('text_one') | 2.48 ms ± 53.4 µs per loop | 12553 | 51.2x |
PySubstringSearch | 500mb | reader.search_parallel('text_one') | 3.78 ms ± 350 µs per loop | 12553 | 33.6x |
ripgrepy | 500mb | Ripgrepy('text_two', '500mb').run().as_string.split('\n') | 127 ms ± 623 µs per loop | 769 | 1.0x |
PySubstringSearch | 500mb | reader.search_sequential('text_two') | 156 µs ± 916 ns per loop | 769 | 814.0x |
PySubstringSearch | 500mb | reader.search_parallel('text_two') | 251 µs ± 80.2 µs per loop | 769 | 506.0x |
ripgrepy | 6gb | Ripgrepy('text_one', '6gb').run().as_string.split('\n') | 1.38 s ± 3.82 ms | 206884 | 1.0x |
PySubstringSearch | 6gb | reader.search_sequential('text_one') | 93.7 ms ± 2.16 ms per loop | 206884 | 15.3x |
PySubstringSearch | 6gb | reader.search_parallel('text_one') | 34.3 ms ± 321 µs per loop | 206884 | 40.5x |
ripgrepy | 6gb | Ripgrepy('text_two', '6gb').run().as_string.split('\n') | 1.61 s ± 37.2 ms per loop | 6921 | 1.0x |
PySubstringSearch | 6gb | reader.search_sequential('text_two') | 2.22 ms ± 79.3 µs per loop | 6921 | 725.2x |
PySubstringSearch | 6gb | reader.search_parallel('text_two') | 1.38 ms ± 26 µs per loop | 6921 | 1166.6x |
先决条件
为了编译这个包,应该安装GCC&Python开发包。在
- 软呢帽
sudo dnf install python3-devel gcc-c++
- Ubuntu 18.04版
安装
pip3 install PySubstringSearch
使用
创建索引
importpysubstringsearch# creating a new index file# if a file with this name is already exists, it will be overwrittenwriter=pysubstringsearch.Writer(index_file_path='output.idx',)# adding entries to the new indexwriter.add_entry('some short string')writer.add_entry('another but now a longer string')writer.add_entry('more text to add')# making sure the data is dumped to the filewriter.finalize()
在索引中搜索子字符串
importpysubstringsearch# opening an index file for searchingreader=pysubstringsearch.Reader(index_file_path='output.idx',)# lookup for a substring sequentiallyreader.search_sequential('short')>>>['some short string']# lookup for a substring sequentiallyreader.search_sequential('string')>>>['some short string','another but now a longer string']# lookup for a substring concurrentlyreader.search_parallel('short')>>>['some short string']# lookup for a substring concurrentlyreader.search_parallel('string')>>>['some short string','another but now a longer string']
许可证
根据麻省理工学院的许可证分发。有关详细信息,请参见LICENSE
。在
联系人
加本大卫-gal@intsights.com
项目链接:https://github.com/Intsights/PySubstringSearch
- 项目
标签: