Python bagofwords包_程序模块 - PyPI

python模块的主要目标是提供应用文本分类的函数。

bagofwords的Python项目详细描述

简介

一个python模块，允许您创建和管理不考虑语法的单词出现次数。主要目的提供一组类来管理多个文档分类分类以便应用文本分类。

您可以通过api或通过command line使用。例如，你可以通过命令行和之后通过api对输入文档进行分类。

第三方模块

模块使用三个第三方模块

第一个模块用于stop_words filter，第二个模块是用于stemming过滤器。如果你不使用这两个过滤器，你不需要安装它们。

安装

通过pip

安装

$ [sudo] pip install bagofwords

或者下载zip，然后运行

$ [sudo] python setup.py install

您可以通过运行

$ [sudo] python setup.py test

卸载

$ [sudo] pip uninstall bagofwords

python api

方法

document_classifier(document, **classifieds)文本分类基于naive bayes的实现

模块包含两个主要类DocumentClass和Document和四个二级类BagOfWords，WordFilters，TextFilters 以及Tokenizer

主要类别

DocumentClass实现一个单词包集合，其中一袋字和一袋字是同一类的和所有的单词。每袋文字都有一个否则它被分配一个计算的标识符。检索文件、文件夹、url或zip的文本，还允许以json格式保存或检索集合。
Document实现一个单词包，其中所有单词都属于同一类别。检索文件、文件夹、url或zip的文本，以及还允许以json格式保存或检索文档。

中级

BagOfWords实现一包单词，其频率为用法。
TextFilters用于转换文本的筛选器。它用在标记器类。包括过滤器upperlowerinvalid_chars和html_to_text
WordFilters用于转换一组单词的筛选器。它用在标记器类。包括过滤器stemmingstopwords和 normalize
Tokenizer允许将字符串分成标记（一组单词）。可选地允许您在（textfilters）之前和之后设置过滤器（wordfilters）将字符串分成标记。

子类

标记器子类DefaultTokenizerSimpleTokenizer和 HtmlTokenizer实现更常见的筛选器和在标记赋予器和berofe\u标记赋予器方法之后覆盖
文档子类DefaultDocumentSimpleDocument和 HtmlDocument
文档类子类DefaultDocumentClassSimpleDocumentClass和HtmlDocumentClass

命令行工具

usage: bow [-h] [--version] {create,learn,show,classify} ...

Manage several document to apply text classification.

positional arguments:
  {create,learn,show,classify}
    create              create classifier
    learn               add words learned a classifier
    show                show classifier info
    classify            Naive Bayes text classification

optional arguments:
  -h, --help            show this help message and exit
  --version             show version and exit

create命令

usage: bow create [-h] [--lang-filter LANG_FILTER]
                  [--stemming-filter STEMMING_FILTER]
                  {text,html} filename

positional arguments:
  {text,html}           filter type
  filename              file to be created where words learned are saved

optional arguments:
  -h, --help            show this help message and exit
  --lang-filter LANG_FILTER
                        language text where remove empty words
  --stemming-filter STEMMING_FILTER
                        number loops of lemmatizing

学习命令

usage: bow learn [-h] [--file FILE [FILE ...]] [--dir DIR [DIR ...]]
                 [--url URL [URL ...]] [--zip ZIP [ZIP ...]] [--no-learn]
                 [--rewrite] [--list-top-words LIST_TOP_WORDS]
                 filename

positional arguments:
  filename              file to write words learned

optional arguments:
  -h, --help            show this help message and exit
  --file FILE [FILE ...]
                        filenames to learn
  --dir DIR [DIR ...]   directories to learn
  --url URL [URL ...]   url resources to learn
  --zip ZIP [ZIP ...]   zip filenames to learn
  --no-learn            not write to file the words learned
  --rewrite             overwrite the file
  --list-top-words LIST_TOP_WORDS
                        maximum number of words to list, 50 by default, -1
                        list all

show命令

usage: bow show [-h] [--list-top-words LIST_TOP_WORDS] filename

positional arguments:
  filename              filename

optional arguments:
  -h, --help            show this help message and exit
  --list-top-words LIST_TOP_WORDS
                        maximum number of words to list, 50 by default, -1
                        list all

classify命令

usage: bow classify [-h] [--file FILE] [--url URL] [--text TEXT]
                    classifiers [classifiers ...]

positional arguments:
  classifiers  classifiers

optional arguments:
  -h, --help   show this help message and exit
  --file FILE  file to classify
  --url URL    url resource to classify
  --text TEXT  text to classify

示例

以前您需要下载垃圾邮件语料库enron spam dataset。例如，可以下载包含目录的压缩文件包含1500封垃圾邮件和4012封ham电子邮件的目录。

http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron3.tar.gz

现在我们将创建spam和ham分类器

$ bow create text spam
* filename: spam
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 0
* total docs: 0

$ bow create text ham
* filename: ham
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 0
* total docs: 0

是时候学习了

$ bow learn spam --dir enron3/spam

current
=======
* filename: spam
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 0
* total docs: 0

updated
=======
* filename: spam
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 223145
* total docs: 1500
* pos | word (top 50)                       | occurrence |       rate
  --- | ----------------------------------- | ---------- | ----------
    1 | "                                   |       2438 | 0.01092563
    2 | subject                             |       1662 | 0.00744807
    3 | compani                             |       1659 | 0.00743463
    4 | s                                   |       1499 | 0.00671761
    5 | will                                |       1194 | 0.00535078
    6 | com                                 |        978 | 0.00438280
    7 | statement                           |        935 | 0.00419010
    8 | secur                               |        908 | 0.00406910
    9 | inform                              |        880 | 0.00394362
   10 | e                                   |        802 | 0.00359408
   11 | can                                 |        798 | 0.00357615
   12 | http                                |        779 | 0.00349100
   13 | pleas                               |        743 | 0.00332967
   14 | invest                              |        740 | 0.00331623
   15 | de                                  |        739 | 0.00331175
   16 | o                                   |        733 | 0.00328486
   17 | 1                                   |        732 | 0.00328038
   18 | 2                                   |        709 | 0.00317731
   19 | stock                               |        700 | 0.00313697
   20 | price                               |        664 | 0.00297564
  ....

$ bow learn ham --dir enron3/ham

current
=======
* filename: ham
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 0
* total docs: 0

updated
=======
* filename: ham
* filter:
    type: DefaultDocument
    lang: english
    stemming: 1
* total words: 1293023
* total docs: 4012
* pos | word (top 50)                       | occurrence |       rate
  --- | ----------------------------------- | ---------- | ----------
    1 | enron                               |      29805 | 0.02305063
    2 | s                                   |      22438 | 0.01735313
    3 | "                                   |      15712 | 0.01215137
    4 | compani                             |      12039 | 0.00931074
    5 | said                                |       9470 | 0.00732392
    6 | will                                |       8862 | 0.00685371
    7 | 2001                                |       8293 | 0.00641365
    8 | subject                             |       7167 | 0.00554282
    9 | 1                                   |       5887 | 0.00455290
   10 | trade                               |       5718 | 0.00442220
   11 | energi                              |       5599 | 0.00433016
   12 | market                              |       5498 | 0.00425205
   13 | new                                 |       5278 | 0.00408191
   14 | 2                                   |       4742 | 0.00366737
   15 | dynegi                              |       4651 | 0.00359700
   16 | stock                               |       4594 | 0.00355291
   17 | 10                                  |       4545 | 0.00351502
   18 | year                                |       4517 | 0.00349336
   19 | power                               |       4503 | 0.00348254
   20 | share                               |       4393 | 0.00339746
 ....

最后，我们可以对文本文件或url进行分类

$ bow classify spam ham --text "company"

* classifier                          |       rate
  ----------------------------------- | ----------
  ham                                 | 0.87888743
  spam                                | 0.12111257

$ bow classify spam ham --text "new lottery"

* classifier                          |       rate
  ----------------------------------- | ----------
  spam                                | 0.96633627
  ham                                 | 0.03366373

$ bow classify spam ham --text "Subject: a friendly professional online pharmacy focused on you !"

* classifier                          |       rate
  ----------------------------------- | ----------
  spam                                | 0.99671480
  ham                                 | 0.00328520

你应该知道也有可能从python代码ssify

import bow

spam = bow.Document.load('spam')
ham = bow.Document.load('ham')
dc = bow.DefaultDocument()

dc.read_text("company")
result = bow.document_classifier(dc, spam=spam, ham=ham)

print result

结果

[('ham', 0.8788874288217258), ('spam', 0.12111257117827418)]

其他示例

加入多个单词包

from bow import BagOfWords

a = BagOfWords('car', 'chair', 'chicken')
b = BagOfWords({'chicken':2}, ['eye', 'ugly'])
c = BagOfWords('plane')

print a + b + c
print a - b - c

结果

{'eye': 1, 'car': 1, 'ugly': 1, 'plane': 1, 'chair': 1, 'chicken': 3}
{'car': 1, 'chair': 1}

html文档类

from bow import HtmlDocumentClass

html_one = '''
<!DOCTYPE html>
<html lang="en">
<head>
  <title>bag of words demo</title>
  <link rel="stylesheet" href="css/mycss.css">
  <script src="js/myjs.js"></script>
</head>
<body>
  <style> #demo {background: #c00; color: #fff; padding: 10px;}</style>
  <!--my comment section -->
  <h2>This is a demo</h2>
  <p id="demo">This a text example of my bag of words demo!</p>
  I hope this demo is useful for you
  <script type="text/javascript"> alert('But wait, it\'s a demo...');</script>
</body>
</html>
'''

html_two = '''
<!DOCTYPE html>
<html lang="en">
<head> </head>
<body> Another silly example. </body>
</html>
'''

dclass = HtmlDocumentClass(lang='english', stemming=0)
dclass(id_='doc1', text=html_one)
dclass(id_='doc2', text=html_two)
print 'docs \n', dclass.docs
print 'total \n', dclass
print 'rates \n', dclass.rates

结果

>>>
docs
{
 'doc2': {u'silly': 1, u'example': 1, u'another': 1},
 'doc1': {u'useful': 1, u'text': 1, u'bag': 2, u'words': 2, u'demo': 4, u'example': 1, u'hope': 1}
}
total
{
 u'useful': 1, u'another': 1, u'text': 1, u'bag': 2, u'silly': 1, u'words': 2,
 u'demo': 4, u'example': 2, u'hope': 1
}
rates
{
 u'useful': 0.06666666666666667, u'another': 0.06666666666666667, u'text': 0.06666666666666667,
 u'bag': 0.13333333333333333, u'silly': 0.06666666666666667, u'words': 0.13333333333333333,
 u'demo': 0.26666666666666666, u'example': 0.13333333333333333, u'hope': 0.06666666666666667
}
>>>

许可证

麻省理工学院执照，见 LICENSE

欢迎加入QQ群-->： 979659372

bagofwords 1.0.4

bagofwords的Python项目详细描述

简介

第三方模块

安装

卸载

python api

方法

主要类别

中级

子类

命令行工具

示例

其他示例

许可证

推荐PyPI第三方库

ewah

pythonbioformats

faethm

rpaframework-http

mlaws-distributions

distributions-jw

zqygis

etaf-config

pymmdbencoder

movie-colorbar

objectdict

gemma

ner-s2s

collective.saml2

bcj-cffi

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

bagofwords 1.0.4

bagofwords的Python项目详细描述

简介

第三方模块

安装

卸载

python api

方法

主要类别

中级

子类

命令行工具

示例

其他示例

许可证

推荐PyPI第三方库

ewah

pythonbioformats

faethm

rpaframework-http

mlaws-distributions

distributions-jw

zqygis

etaf-config

pymmdbencoder

movie-colorbar

objectdict

gemma

ner-s2s

collective.saml2

bcj-cffi

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签