为爬虫平台提供一个通用的解决方案。更多阅读:https://github.com/ClericPy/uniparser。

uniparser的Python项目详细描述


uniparser

PyPIGitHub Workflow StatusPyPI - WheelPyPI - Python VersionPyPI - DownloadsPyPI - License

为爬虫提供一个通用的解决方案。在

安装

pip install uniparser -U

为什么?在

  1. 减少了大量类似爬虫和解析器的代码量。不要重复你自己。在
  2. 使不同解析器的解析过程持久化。在
  3. 将爬虫程序代码与主应用程序代码分离,添加新爬虫程序时无需重新部署应用程序。在
  4. 为爬虫平台提供了一个通用的解决方案。在
  5. 总结市场上常见的字符串解析工具。在
  6. web视图的实现是插件式的和可移植的,这意味着它可以作为sub_app安装在其他web应用程序上:
    1. app.mount("/uniparser", uniparser_app)

功能列表

  1. 支持大多数流行的HTML/XML/JSON/AnyString/Python对象解析器

    1. Parser docs

    2.   1. css (HTML)
            1. bs4
        2. xml
            1. lxml
        3. regex
        4. jsonpath
            1. jsonpath-rw-ext
        5. objectpath
            1. objectpath
        6. jmespath
            1. jmespath
        7. time
        8. loader
            1. json / yaml / toml
                1. toml
                2. pyyaml
        9. udf
            1. source code for exec & eval which named as **parse**
        10. python
            1. some  common python methods, getitem, split, join...
        11. *waiting for new ones...*
      
  2. 请求args持久性,支持curl string,单url,dict,json。在

  3. 一个用于生成和测试CrawlerRule的简单Web用户界面。在

  4. 可序列化的JSON规则类,用于保存整个解析过程。在

    1. 每个parserrule/CrawlerRule/HostRule子类可以是json.dumps文件到JSON以实现持久性。在
    2. 因此,它们也可以从JSON字符串加载。在
    3. 规则名称的嵌套关系将被视为结果格式。(如果有childs,则规则的结果将被忽略。)
  5. 规则类

    1. JsonSerializable是所有规则的基类。
      1. dumps类方法可以将self作为标准JSON字符串转储。在
      2. loads classmethod可以从标准JSON字符串加载self,这意味着新对象将以这些方法作为规则。在
    2. ^{str1}$parserrule是解析任务的最低级别,它包含如何解析输入对象。有时候,parseList的子规则也有。
      1. Parse result是一个dict,其中rule_name作为key,result作为value。在
    3. crawlarrule包含一些ParseRules,除了规则名之外,还有3个属性:
      1. request_args告诉http下载器如何发送请求。在
      2. parse_rules是parserrule的列表,解析结果格式类似于{CrawlerRule_name:{ParseRule1['name']:ParseRule1\'result,ParseRule2['name']:ParseRule2\'result}}。在
      3. regex告诉如何使用给定的url查找crawler\u规则。在
    4. ^{str1}$HostRule包含一个dict,例如:{crawlarrule['name']:crawlarrule},通过find方法,它可以使用给定的url获取指定的crawlarrule。在
    5. ^{str1}$JSONRuleStorage是一种简单的存储方式,它将HostRules保存在JSON文件中。在生产环境上这不是一个好的选择,也许redis/mysql/mongodb可以帮上忙。在
  6. Uniparser是整个爬虫进程的中心控制台。它处理下载中间件、解析中间件。详细用法请参见uniparser.crawler.crawler,或在[快速入门]获得战利品。在

  7. 对于自定义设置(如json加载程序),请更新uniparser.config.GlobalConfig. 在

快速入门

Mission: Crawl python Meta-PEPs

Only less than 25 lines necessary code besides the rules(which can be saved outside and auto loaded).

HostRules will be saved at $HOME/host_rules.json by default, not need to init every time.

^{pr2}$
fromuniparserimportCrawler,JSONRuleStorageimportasynciocrawler=Crawler(storage=JSONRuleStorage.loads(r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'))expected_result={'list':{'__request__':['https://www.python.org/dev/peps/pep-0001','https://www.python.org/dev/peps/pep-0004','https://www.python.org/dev/peps/pep-0005'],'__result__':[{'detail':{'title':'PEP 1 -- PEP Purpose and Guidelines'}},{'detail':{'title':'PEP 4 -- Deprecation of Standard Modules'}},{'detail':{'title':'PEP 5 -- Guidelines for Language Evolution'}}]}}deftest_sync_crawler():result=crawler.crawl('https://www.python.org/dev/peps/')print('sync result:',result)assertresult==expected_resultdeftest_async_crawler():asyncdef_test():result=awaitcrawler.acrawl('https://www.python.org/dev/peps/')print('sync result:',result)assertresult==expected_resultasyncio.run(_test())test_sync_crawler()test_async_crawler()

Uniparser规则测试控制台(Web UI)

  1. pip install bottle uniparser
  2. python -m uniparser 8080
  3. open browser => http://127.0.0.1:8080/

1.png

2.png

将结果显示为repr(result)

{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}

如我们所见,crawlarrule的名称是根密钥,parserrule的名称是其他密钥。在

异步环境用法:Fastapi

importuvicornfromuniparser.fastapi_uiimportappif__name__=="__main__":uvicorn.run(app,port=8080)# http://127.0.0.1:8080

或Fastapi subapp用法

importuvicornfromfastapiimportFastAPIfromuniparser.fastapi_uiimportappassub_appapp=FastAPI()app.mount('/uniparser',sub_app)if__name__=="__main__":uvicorn.run(app,port=8080)# http://127.0.0.1:8080/uniparser/

更多的使用

一些演示:单击Web UI顶部的下拉按钮

测试代码:test_parsers.py

高级用法:Create crawler rule表示watchdogs

Generate parsers doc

fromuniparserimportUniparserforiinUniparser().parsers:print(f'## {i.__class__.__name__} ({i.name})\n\n```\n{i.doc}\n```')

基准

Compare parsers and choose a faster one

css:2558calls/sec,['<a class="url" href="/">title</a>','a.url','@href']css:2491calls/sec,['<a class="url" href="/">title</a>','a.url','$text']css:2385calls/sec,['<a class="url" href="/">title</a>','a.url','$innerHTML']css:2495calls/sec,['<a class="url" href="/">title</a>','a.url','$html']css:2296calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']css:2182calls/sec,['<a class="url" href="/">title</a>','a.url','$string']css:2130calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================css1:2525calls/sec,['<a class="url" href="/">title</a>','a.url','@href']css1:2402calls/sec,['<a class="url" href="/">title</a>','a.url','$text']css1:2321calls/sec,['<a class="url" href="/">title</a>','a.url','$innerHTML']css1:2256calls/sec,['<a class="url" href="/">title</a>','a.url','$html']css1:2122calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']css1:2142calls/sec,['<a class="url" href="/">title</a>','a.url','$string']css1:2483calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================selectolax:15187calls/sec,['<a class="url" href="/">title</a>','a.url','@href']selectolax:19164calls/sec,['<a class="url" href="/">title</a>','a.url','$text']selectolax:19699calls/sec,['<a class="url" href="/">title</a>','a.url','$html']selectolax:20659calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']selectolax:20369calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================selectolax1:17572calls/sec,['<a class="url" href="/">title</a>','a.url','@href']selectolax1:19096calls/sec,['<a class="url" href="/">title</a>','a.url','$text']selectolax1:17997calls/sec,['<a class="url" href="/">title</a>','a.url','$html']selectolax1:18100calls/sec,['<a class="url" href="/">title</a>','a.url','$outerHTML']selectolax1:19137calls/sec,['<a class="url" href="/">title</a>','a.url','$self']=================================================================================xml:3171calls/sec,['<dc:creator><![CDATA[author]]></dc:creator>','creator','$text']=================================================================================re:220240calls/sec,['a a b b c c','a|c','@b']re:334206calls/sec,['a a b b c c','a','']re:199572calls/sec,['a a b b c c','a (a b)','$0']re:203122calls/sec,['a a b b c c','a (a b)','$1']re:256544calls/sec,['a a b b c c','b','-']=================================================================================jsonpath:28calls/sec,[{'a':{'b':{'c':1}}},'$..c','']=================================================================================objectpath:42331calls/sec,[{'a':{'b':{'c':1}}},'$..c','']=================================================================================jmespath:95449calls/sec,[{'a':{'b':{'c':1}}},'a.b.c','']=================================================================================udf:58236calls/sec,['a b c d','input_object[::-1]','']udf:64846calls/sec,['a b c d','context["key"]',{'key':'value'}]udf:55169calls/sec,['a b c d','md5(input_object)','']udf:45388calls/sec,['["string"]','json_loads(input_object)','']udf:50741calls/sec,['["string"]','json_loads(obj)','']udf:48974calls/sec,[['string'],'json_dumps(input_object)','']udf:41670calls/sec,['a b c d','parse = lambda input_object: input_object','']udf:31930calls/sec,['a b c d','def parse(input_object): context["key"]="new";return context',{'key':'new'}]=================================================================================python:383293calls/sec,[[1,2,3],'getitem','[-1]']python:350290calls/sec,[[1,2,3],'getitem','[:2]']python:325668calls/sec,['abc','getitem','[::-1]']python:634737calls/sec,[{'a':'1'},'getitem','a']python:654257calls/sec,[{'a':'1'},'get','a']python:642111calls/sec,['a b\tc \n\td','split','']python:674048calls/sec,[['a','b','c','d'],'join','']python:478239calls/sec,[['aaa',['b'],['c','d']],'chain','']python:191430calls/sec,['python','template','1 $input_object 2']python:556022calls/sec,[[1],'index','0']python:474540calls/sec,['python','index','-1']python:619489calls/sec,[{'a':'1'},'index','a']python:457317calls/sec,['adcb','sort','']python:494608calls/sec,[[1,3,2,4],'sort','desc']python:581480calls/sec,['aabbcc','strip','a']python:419745calls/sec,['aabbcc','strip','ac']python:615518calls/sec,[' \t a ','strip','']python:632536calls/sec,['a','default','b']python:655448calls/sec,['','default','b']python:654189calls/sec,[' ','default','b']python:373153calls/sec,['a','base64_encode','']python:339589calls/sec,['YQ==','base64_decode','']python:495246calls/sec,['a','0','b']python:358796calls/sec,['','0','b']python:356988calls/sec,[None,'0','b']python:532092calls/sec,[{0:'a'},'0','a']=================================================================================loader:159737calls/sec,['{"a": "b"}','json','']loader:38540calls/sec,['a = "a"','toml','']loader:3972calls/sec,['animal: pets','yaml','']loader:461297calls/sec,['a','b64encode','']loader:412507calls/sec,['YQ==','b64decode','']=================================================================================time:39241calls/sec,['2020-02-03 20:29:45','encode','']time:83251calls/sec,['1580732985.1873155','decode','']time:48469calls/sec,['2020-02-03T20:29:45','encode','%Y-%m-%dT%H:%M:%S']time:74481calls/sec,['1580732985.1873155','decode','%b %d %Y %H:%M:%S']

任务

  • [x] 释放到pypi.org网站
    • [x] Upl公司使用Web UI加载距离
  • [x] 为测试包添加github actions
  • [x] 用于测试规则的Web UI
  • [x] 详细填写文件
  • [x] 比较每个解析器的性能

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java如何从IBM MQ的JMSException检测可恢复错误   java Lucene6。4.2:找不到类,尝试添加查询。   Java Pdf差异库   在Java中多线程处理我的线程   java将字符串传递给Uri。下载中的parse()   java在列表中查找原语位置   java JPA条件从另一个查询中选择   java中的强制转换和转换   java如何在没有上下文的情况下获取SOAP Web服务(Apache Axis 1.4)的调用客户端的IP地址   java Android IllegalBlockSizeException:解密中最后一个块未完成   java Jersey是否要自定义无效资源路径的错误处理?   如何将JavaCVS web项目转换为基于maven的web项目?   java如何检查通用列表是否与jUnit相等?   arraylist java。util。尝试使用迭代器时发生ConcurrentModificationException错误   使用springsecurity,jsp上的java${u csrf.token}始终为空   sql使用java从临时表中选择数据   spring验证中的java@Notnull和@Pattern无效   java如何使用jQuery将包含对象数组的对象数组传递给Spring MVC控制器?