为爬虫平台提供一个通用的解决方案。更多阅读:https://github.com/ClericPy/uniparser。
uniparser的Python项目详细描述
uniparser
为爬虫提供一个通用的解决方案。在
安装
pip install uniparser -U
为什么?在
- 减少了大量类似爬虫和解析器的代码量。不要重复你自己。在
- 使不同解析器的解析过程持久化。在
- 将爬虫程序代码与主应用程序代码分离,添加新爬虫程序时无需重新部署应用程序。在
- 为爬虫平台提供了一个通用的解决方案。在
- 总结市场上常见的字符串解析工具。在
- web视图的实现是插件式的和可移植的,这意味着它可以作为sub_app安装在其他web应用程序上:
app.mount("/uniparser", uniparser_app)
功能列表
- 在
支持大多数流行的HTML/XML/JSON/AnyString/Python对象解析器
- 在 在
- 在
在1. css (HTML) 1. bs4 2. xml 1. lxml 3. regex 4. jsonpath 1. jsonpath-rw-ext 5. objectpath 1. objectpath 6. jmespath 1. jmespath 7. time 8. loader 1. json / yaml / toml 1. toml 2. pyyaml 9. udf 1. source code for exec & eval which named as **parse** 10. python 1. some common python methods, getitem, split, join... 11. *waiting for new ones...*
- 在
请求args持久性,支持curl string,单url,dict,json。在
在 - 在
一个用于生成和测试CrawlerRule的简单Web用户界面。在
在 - 在
可序列化的JSON规则类,用于保存整个解析过程。在
- 每个parserrule/CrawlerRule/HostRule子类可以是json.dumps文件到JSON以实现持久性。在
- 因此,它们也可以从JSON字符串加载。在
- 规则名称的嵌套关系将被视为结果格式。(如果有childs,则规则的结果将被忽略。)
- 在
规则类
- JsonSerializable是所有规则的基类。
- dumps类方法可以将self作为标准JSON字符串转储。在
- loads classmethod可以从标准JSON字符串加载self,这意味着新对象将以这些方法作为规则。在
- ^{str1}$parserrule是解析任务的最低级别,它包含如何解析输入对象。有时候,parseList的子规则也有。
- Parse result是一个dict,其中rule_name作为key,result作为value。在
- crawlarrule包含一些ParseRules,除了规则名之外,还有3个属性:
- request_args告诉http下载器如何发送请求。在
- parse_rules是parserrule的列表,解析结果格式类似于{CrawlerRule_name:{ParseRule1['name']:ParseRule1\'result,ParseRule2['name']:ParseRule2\'result}}。在
- regex告诉如何使用给定的url查找crawler\u规则。在
- ^{str1}$HostRule包含一个dict,例如:{crawlarrule['name']:crawlarrule},通过find方法,它可以使用给定的url获取指定的crawlarrule。在
- ^{str1}$JSONRuleStorage是一种简单的存储方式,它将HostRules保存在JSON文件中。在生产环境上这不是一个好的选择,也许redis/mysql/mongodb可以帮上忙。在
- JsonSerializable是所有规则的基类。
- 在
Uniparser是整个爬虫进程的中心控制台。它处理下载中间件、解析中间件。详细用法请参见uniparser.crawler.crawler,或在[快速入门]获得战利品。在
在 - 在
对于自定义设置(如json加载程序),请更新uniparser.config.GlobalConfig. 在
在
快速入门
^{pr2}$Mission: Crawl python Meta-PEPs
Only less than 25 lines necessary code besides the rules(which can be saved outside and auto loaded).
HostRules will be saved at
$HOME/host_rules.json
by default, not need to init every time.
fromuniparserimportCrawler,JSONRuleStorageimportasynciocrawler=Crawler(storage=JSONRuleStorage.loads(r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'))expected_result={'list':{'__request__':['https://www.python.org/dev/peps/pep-0001','https://www.python.org/dev/peps/pep-0004','https://www.python.org/dev/peps/pep-0005'],'__result__':[{'detail':{'title':'PEP 1 -- PEP Purpose and Guidelines'}},{'detail':{'title':'PEP 4 -- Deprecation of Standard Modules'}},{'detail':{'title':'PEP 5 -- Guidelines for Language Evolution'}}]}}deftest_sync_crawler():result=crawler.crawl('https://www.python.org/dev/peps/')print('sync result:',result)assertresult==expected_resultdeftest_async_crawler():asyncdef_test():result=awaitcrawler.acrawl('https://www.python.org/dev/peps/')print('sync result:',result)assertresult==expected_resultasyncio.run(_test())test_sync_crawler()test_async_crawler()
Uniparser规则测试控制台(Web UI)
- pip install bottle uniparser
- python -m uniparser 8080
- open browser => http://127.0.0.1:8080/
将结果显示为repr(result)
{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}
如我们所见,crawlarrule的名称是根密钥,parserrule的名称是其他密钥。在
异步环境用法:Fastapi
importuvicornfromuniparser.fastapi_uiimportappif__name__=="__main__":uvicorn.run(app,port=8080)# http://127.0.0.1:8080
或Fastapi subapp用法
importuvicornfromfastapiimportFastAPIfromuniparser.fastapi_uiimportappassub_appapp=FastAPI()app.mount('/uniparser',sub_app)if__name__=="__main__":uvicorn.run(app,port=8080)# http://127.0.0.1:8080/uniparser/
更多的使用
一些演示:单击Web UI顶部的下拉按钮
测试代码:test_parsers.py
高级用法:Create crawler rule表示watchdogs
Generate parsers doc
fromuniparserimportUniparserforiinUniparser().parsers:print(f'## {i.__class__.__name__} ({i.name})\n\n```\n{i.doc}\n```')
基准
Compare parsers and choose a faster one
css:2558calls/sec,['<a class="url" href="/">title</a>','a.url','@href']css:2491calls/sec,['<a class="url" href="/">title</a>','a.url','$text']css:2385calls/sec,['<a class="url" href="/">title</a>','a.url','$innerHTML']css:2495calls/sec,['<a class="url" href="/">title</a>','a.url','$html']css:2296calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']css:2182calls/sec,['<a class="url" href="/">title</a>','a.url','$string']css:2130calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================css1:2525calls/sec,['<a class="url" href="/">title</a>','a.url','@href']css1:2402calls/sec,['<a class="url" href="/">title</a>','a.url','$text']css1:2321calls/sec,['<a class="url" href="/">title</a>','a.url','$innerHTML']css1:2256calls/sec,['<a class="url" href="/">title</a>','a.url','$html']css1:2122calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']css1:2142calls/sec,['<a class="url" href="/">title</a>','a.url','$string']css1:2483calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================selectolax:15187calls/sec,['<a class="url" href="/">title</a>','a.url','@href']selectolax:19164calls/sec,['<a class="url" href="/">title</a>','a.url','$text']selectolax:19699calls/sec,['<a class="url" href="/">title</a>','a.url','$html']selectolax:20659calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']selectolax:20369calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================selectolax1:17572calls/sec,['<a class="url" href="/">title</a>','a.url','@href']selectolax1:19096calls/sec,['<a class="url" href="/">title</a>','a.url','$text']selectolax1:17997calls/sec,['<a class="url" href="/">title</a>','a.url','$html']selectolax1:18100calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']selectolax1:19137calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================xml:3171calls/sec,['<dc:creator><![CDATA[author]]></dc:creator>','creator','$text']=================================================================================re:220240calls/sec,['a a b b c c','a|c','@b']re:334206calls/sec,['a a b b c c','a','']re:199572calls/sec,['a a b b c c','a (a b)','$0']re:203122calls/sec,['a a b b c c','a (a b)','$1']re:256544calls/sec,['a a b b c c','b','-']=================================================================================jsonpath:28calls/sec,[{'a':{'b':{'c':1}}},'$..c','']=================================================================================objectpath:42331calls/sec,[{'a':{'b':{'c':1}}},'$..c','']=================================================================================jmespath:95449calls/sec,[{'a':{'b':{'c':1}}},'a.b.c','']=================================================================================udf:58236calls/sec,['a b c d','input_object[::-1]','']udf:64846calls/sec,['a b c d','context["key"]',{'key':'value'}]udf:55169calls/sec,['a b c d','md5(input_object)','']udf:45388calls/sec,['["string"]','json_loads(input_object)','']udf:50741calls/sec,['["string"]','json_loads(obj)','']udf:48974calls/sec,[['string'],'json_dumps(input_object)','']udf:41670calls/sec,['a b c d','parse = lambda input_object: input_object','']udf:31930calls/sec,['a b c d','def parse(input_object): context["key"]="new";return context',{'key':'new'}]=================================================================================python:383293calls/sec,[[1,2,3],'getitem','[-1]']python:350290calls/sec,[[1,2,3],'getitem','[:2]']python:325668calls/sec,['abc','getitem','[::-1]']python:634737calls/sec,[{'a':'1'},'getitem','a']python:654257calls/sec,[{'a':'1'},'get','a']python:642111calls/sec,['a b\tc \n\td','split','']python:674048calls/sec,[['a','b','c','d'],'join','']python:478239calls/sec,[['aaa',['b'],['c','d']],'chain','']python:191430calls/sec,['python','template','1 $input_object 2']python:556022calls/sec,[[1],'index','0']python:474540calls/sec,['python','index','-1']python:619489calls/sec,[{'a':'1'},'index','a']python:457317calls/sec,['adcb','sort','']python:494608calls/sec,[[1,3,2,4],'sort','desc']python:581480calls/sec,['aabbcc','strip','a']python:419745calls/sec,['aabbcc','strip','ac']python:615518calls/sec,[' \t a ','strip','']python:632536calls/sec,['a','default','b']python:655448calls/sec,['','default','b']python:654189calls/sec,[' ','default','b']python:373153calls/sec,['a','base64_encode','']python:339589calls/sec,['YQ==','base64_decode','']python:495246calls/sec,['a','0','b']python:358796calls/sec,['','0','b']python:356988calls/sec,[None,'0','b']python:532092calls/sec,[{0:'a'},'0','a']=================================================================================loader:159737calls/sec,['{"a": "b"}','json','']loader:38540calls/sec,['a = "a"','toml','']loader:3972calls/sec,['animal: pets','yaml','']loader:461297calls/sec,['a','b64encode','']loader:412507calls/sec,['YQ==','b64decode','']=================================================================================time:39241calls/sec,['2020-02-03 20:29:45','encode','']time:83251calls/sec,['1580732985.1873155','decode','']time:48469calls/sec,['2020-02-03T20:29:45','encode','%Y-%m-%dT%H:%M:%S']time:74481calls/sec,['1580732985.1873155','decode','%b %d %Y %H:%M:%S']
任务
- [x] 释放到pypi.org网站
- [x] Upl公司使用Web UI加载距离
- [x] 为测试包添加github actions
- [x] 用于测试规则的Web UI
- [x] 详细填写文件
- [x] 比较每个解析器的性能
- 项目
标签: