用于新闻存储和分析的多维数据集
cubicweb-semnews的Python项目详细描述
摘要
用于新闻存储和分析的多维数据集
此多维数据集提供了semnews的实现:
- store news articles and tweets.
- extract and synthetize information.
- provide semantic useful and original visualisation.
- analytics tools and datamining/machine learning processings.
安装
实例的创建:
- Create an instance using: cubicweb-ctl create semnews <name-of-instance>
- Create the instance’s database using: cubicweb-ctl db-create <name-of-instance>
添加文章源
可以使用以下方法创建文章源:
Blogs/RSS feeds:
session.create_entity('CWSource', name=<name of the source>, type=u'datafeed', parser=u'rss-parser', lang=<lang of the source>, url=<url of the blog/rss feed>, config=u'synchronization-interval=120min')Tweet:
session.create_entity('CWSource', name=<name of the source>, type=u'datafeed', parser=u'tweet-parser', lang=<lang of the source>, url=<url of the blog/rss feed>, config=u'synchronization-interval=120min')
同步间隔可以设置为更具体的值,或者设置为“否”进行手动同步 只有。
semnews附带了一些预先定义的博客/推文/rss提要:
Some french political blogs. You can add them using:
cubicweb-ctl shell <name-of-instance> <path-to-cube-code-source>/migration/examples_blogs_fr.pySome international english newspapers. You can add them using:
cubicweb-ctl shell <name-of-instance> <path-to-cube-code-source>/migration/examples_newspapers.pySome french newspapers. You can add them using:
cubicweb-ctl shell <name-of-instance> <path-to-cube-code-source>/migration/examples_newspapers_fr.pySome french politician tweets. You can add them using:
cubicweb-ctl shell <name-of-instance> <path-to-cube-code-source>/migration/examples_twitters_fr.py
添加命名实体源
semnews基于命名实体流程,您必须定义该流程:
session.create_entity('NerProcess', name=<name of process>, host=<appid or sparql endpoint url>, type=<rql or sparql>, lang=<optional lang of the ner source>, request=<request to be performed>)
有关详细信息,请参阅ner多维数据集的文档。 来源示例:
session.create_entity('NerProcess', name=u'dbpedia38-en', host=u'ner', type=u'rql', lang=u'en', request=u'Any U WHERE X label %(token)s, X cwuri U, ' 'X ner_source NS, NS name "dbpedia38-en"')
命令
semnews提供给命令:
A command to extract named entities from articles:
cubicweb-ctl process-ner <name-of-instance>A command to cleanup recognized entities according to some Dbpedia categories (see entities/external_resources.py):
cubicweb-ctl cleanup-ner <name-of-instance>