用于python的文本分析工具包(textlanalyticslab)

TextAnalyticsLab的Python项目详细描述


textAnalyticsSlab(用于Python的文本分析工具包)

当前版本:textlab[v0.1.2]

textAnalyticsSlab-一组用于Python的文本分析工具。

简介

“textAnalyticSlab”是一个Python包,提供一组文本分析工具 用于数据挖掘、机器学习项目和端到端文本分析 应用程序开发。它与数据兼容并与数据互操作 分析与操作库pandas,自然语言处理库 nltk,机器命令tookkit(pymltoolkit mltk),以及许多其他人工智能和机器 学习平台。

安装

pip install TextAnalyticsLab

如果安装因依赖性问题而失败,请在不依赖性的情况下执行上述命令

pip install TextAnalyticsLab --no-dependencies

功能

  • 文本相似性
  • 文本挖掘和信息提取(v0.2.0中)
  • 清除文本内容(在v0.1.5中)
  • 刮网(在v0.1.5中)
  • 文本内容分类(在v0.2.0中)

用法

importtextlab

警告:python变量、函数或类名

python解释器有许多内置函数。在编写代码时,无需python编写器发出警告就可以覆盖这些定义。(https://docs.python.org/3/library/functions.html) 因此,请避免将这些名称用作变量、函数或类名。

absallanyasciibinboolbytearraybytes
callablechrclassmethodcompilecomplexdelattrdictdir
divmodenumerateevalexecfilterfloatformatfrozenset
getattrglobalshasattrhashhelphexidinput
intisinstanceissubclassiterlenlistlocalsmap
maxmemoryviewminnextobjectoctopenord
powprintpropertyrangereprreversedroundset
setattrslicesortedstaticmethodstrsumsupertuple
typevarszip__import__

如果继续覆盖任何内置函数(例如list),请执行以下操作以引入内置定义。

del(list)

## Text Analytics Example

### Text Similarity
```python
import textlab

str1 = 'Hello'
str2 = 'Hola'

dl_distance = textlab.damerau_levenshtein_distance(str1, str2, case_sensitive=True, normalized=False)
print('damerau_levenshtein_distance: ', dl_distance)

dl_distance_normalized = textlab.damerau_levenshtein_distance(str1, str2, case_sensitive=True, normalized=True)
print('damerau_levenshtein_distance (normalized): ', dl_distance_normalized)

substrings = textlab.get_substrings(string=str1, case_sensitive=True, min_length=2, max_length=np.inf)
print('substrings: ', substrings)

j_index = textlab.jaccard_index(str1, str2, method='substring', case_sensitive=True, min_length=1, max_length=np.inf) #method='words'
print('jaccard_index: ', round(j_index,3))
damerau_levenshtein_distance:  3
damerau_levenshtein_distance (normalized):  0.6
substrings:  ['He', 'll', 'Hel', 'el', 'llo', 'lo', 'ello', 'Hell', 'Hello', 'ell']
jaccard_index:  0.143
# A paragraph from Wikipedia: https://en.wikipedia.org/wiki/Albert_Einsteintext="""Albert Einstein; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist[5] who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).[3][6]:274 His work is also known for its influence on the philosophy of science.[7][8] He is best known to the general public for his mass–energy equivalence formula {\displaystyle E=mc^{2}} E = mc^2, which has been dubbed "the world's most famous equation".[9] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect",[10] a pivotal step in the development of quantum theory."""text1=normalize_text(text,method='str')text2=normalize_text(text,method='regex')

文本1

'albert einstein march – april was a germanborn theoretical physicist who developed the theory of relativity one of the two pillars of modern physics alongside quantum mechanics his work is also known for its influence on the philosophy of science he is best known to the general public for his mass–energy equivalence formula displaystyle emc e mc which has been dubbed the worlds most famous equation he received the nobel prize in physics for his services to theoretical physics and especially for his discovery of the law of the photoelectric effect a pivotal step in the development of quantum theory'

文本2

'albert einstein march april was a germanborn theoretical physicist who developed the theory of relativity one of the two pillars of modern physics alongside quantum mechanics his work is also known for its influence on the philosophy of science he is best known to the general public for his mass energy equivalence formula displaystyle emc e mc which has been dubbed the worlds most famous equation he received the nobel prize in physics for his services to theoretical physics and especially for his discovery of the law of the photoelectric effect a pivotal step in the development of quantum theory'
#Text from Wikipedia page: https://en.wikipedia.org/wiki/Email_addresstext="""An email address identifies an email box to which email messages are delivered. A wide variety of formats were used in early email systems, but only a single format is used today, following the standards developed for Internet mail systems since the 1980s. This article uses the term email address to refer to the addr-spec defined in RFC 5322, not to the address that is commonly used; the difference is that an address may contain a display name, a comment, or both.Valid email addressessimple@example.comvery.common@example.comdisposable.style.email.with+symbol@example.comother.email-with-hyphen@example.comfully-qualified-domain@example.comuser.name+tag+sorting@example.com (may go to user.name@example.com inbox depending on mail server)x@example.com (one-letter local-part)example-indeed@strange-example.comadmin@mailserver1 (local domain name with no TLD, although ICANN highly discourages dotless email addresses)example@s.example (see the List of Internet top-level domains)" "@example.org (space between the quotes)"john..doe"@example.org (quoted double dot)Invalid email addressesAbc.example.com (no @ character)A@b@c@example.com (only one @ is allowed outside quotation marks)a"b(c)d,e:f;g<h>i[j\k]l@example.com (none of the special characters in this local-part are allowed outside quotation marks)just"not"right@example.com (quoted strings must be dot separated or the only element making up the local-part)this is"not\allowed@example.com (spaces, quotes, and backslashes may only exist when within quoted strings and preceded by a backslash)this\ still\"not\\allowed@example.com (even if escaped (preceded by a backslash), spaces, quotes, and backslashes must still be contained by quotes)"""email_addresses=extract_email_addresses(text)
['simple@example.com',
 'very.common@example.com',
 'disposable.style.email.with+symbol@example.com',
 'other.email-with-hyphen@example.com',
 'fully-qualified-domain@example.com',
 'user.name+tag+sorting@example.com',
 'user.name@example.com',
 'example-indeed@strange-example.com',
 'example@s.example',
 'right@example.com',
 'llowed@example.com',
 'allowed@example.com']
# Scrape Wikipedia page to get a list of countries and Codes for the representation of names of countries and their subdivisions.tablle=extract_tables_webpage(r'https://en.wikipedia.org/wiki/ISO_3166-1')[1]# Required information in the 2nd table extractedtablle.sample(6)
    English short name (using title case) Alpha-2 code Alpha-3 code  Numeric code Link to ISO 3166-2 subdivision codes Independent
143                                Mexico           MX          MEX           484                        ISO 3166-2:MX         Yes
220                              Thailand           TH          THA           764                        ISO 3166-2:TH         Yes
233                  United Arab Emirates           AE          ARE           784                        ISO 3166-2:AE         Yes
81                                 Gambia           GM          GMB           270                        ISO 3166-2:GM         Yes
148                            Montenegro           ME          MNE           499                        ISO 3166-2:ME         Yes
21                                Belgium           BE          BEL            56                        ISO 3166-2:BE         Yes

许可证

Copyright 2019 Sumudu Tennakoon

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

文本分析项目时间表

未来发布计划

  • TBD[v0.1.5]:集成创建清理文本内容和网页抓取
  • 2019-12-31[v0.1.6]:全面的文档,带有一些增强功能的初始版本的主要错误修复版本。
  • [v0.2.0]:集成文本挖掘、信息提取和分类。
  • [v0.3.0]:端到端文本分析应用程序开发

参考文献

其他有用的文本分析和自然语言处理python库

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
如何使用JSON将值从安卓 java类传递到php?   Java MySQL语法错误不会消失   java Android应用程序无法启动活动   bytebuffer在Java中从字节解码实数   java我无法在php中解码json对象   Swing中的JavaFX集成   java如何在JPA实体bean中使用或注释虚拟字段,该字段不应持久化到数据库中   来自另一个活动的java访问方法   java Tapestry动态生成图像   java有没有一种正则表达式方法可以将一组字符替换为另一组字符(比如shell tr命令)?   java通过转换gson将一些特定的表导出为文件   用java格式化字符串并写入文件   Java使用Graphics2D矩形在面板中创建2D平铺贴图?