Python fuzzychinese包_程序模块 - PyPI

A small package to fuzzy match chinese words 中文模糊匹配

fuzzychinese的Python项目详细描述

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词，地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有三种分析方式可以选择，笔划分析(stroke)，部首分析(radical)，和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。

匹配完成后返回相似度分数，匹配的相近词语及其原有索引号。

importpandasaspdfromfuzzychineseimportFuzzyChineseMatchtest_dict=pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])raw_word=pd.Series(['达茂联合旗','长阳县','汩罗市'])assert('汩罗市'!='汨罗市')# They are not the same!fcm=FuzzyChineseMatch(ngram_range=(3,3),analyzer='stroke')fcm.fit(test_dict)top2_similar=fcm.transform(raw_word,n=2)res=pd.concat([raw_word,pd.DataFrame(top2_similar,columns=['top1','top2']),pd.DataFrame(fcm.get_similarity_score(),columns=['top1_score','top2_score']),pd.DataFrame(fcm.get_index(),columns=['top1_index','top2_index'])],axis=1)

	top1	top2	top1_score	top2_score	top1_index
达茂联合旗	达尔罕茂明安联合旗	长白朝鲜族自治县	0.824751	0.287237	3
长阳县	长阳土家族自治县	长白朝鲜族自治县	0.610285	0.475000	1
汩罗市	汨罗市	长白朝鲜族自治县	1.000000	0.152093	4

其他功能

直接使用Stroke, Radical进行汉字分解。

stroke=Stroke()radical=Radical()print("像",stroke.get_stroke("像"))print("像",radical.get_radical("像"))

像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
像 人象

使用FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较，获得相似度分数。
详情请参见说明文档.

致谢

拆字数据来自于漢語拆字字典 by 開放詞典網。

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.

After the matching, similarity score, matched words and its corresponding index are returned.

fromfuzzychineseimportFuzzyChineseMatchtest_dict=pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])raw_word=pd.Series(['达茂联合旗','长阳县','汩罗市'])assert('汩罗市'!='汨罗市')# They are not the same!fcm=FuzzyChineseMatch(ngram_range=(3,3),analyzer='stroke')fcm.fit(test_dict)top2_similar=fcm.transform(raw_word,n=2)res=pd.concat([raw_word,pd.DataFrame(top2_similar,columns=['top1','top2']),pd.DataFrame(fcm.get_similarity_score(),columns=['top1_score','top2_score']),pd.DataFrame(fcm.get_index(),columns=['top1_index','top2_index'])],axis=1)

	top1	top2	top1_score	top2_score	top1_index
达茂联合旗	达尔罕茂明安联合旗	长白朝鲜族自治县	0.824751	0.287237	3
长阳县	长阳土家族自治县	长白朝鲜族自治县	0.610285	0.475000	1
汩罗市	汨罗市	长白朝鲜族自治县	1.000000	0.152093	4

Other use

Directly use Stroke, Radical to decompose Chinese character into strokes or radicals.

stroke=Stroke()radical=Radical()print("像",stroke.get_stroke("像"))print("像",radical.get_radical("像"))

像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
像 人象

Use FuzzyChineseMatch.compare_two_columns(X, Y) to compare the pair of words in each row to get similarity score.
See documentation for details.

Credits

Data for Chinese radicals are from 漢語拆字字典 by 開放詞典網.

欢迎加入QQ群-->： 979659372

fuzzychinese 0.1.5

fuzzychinese的Python项目详细描述

fuzzychinese

安装说明

使用说明

其他功能

致谢

Installation

Quickstart

Other use

Credits

推荐PyPI第三方库

dscyd

collective.portlet.globalnav

odoo11-addon-purchase-location-by-line

empiricalutilities

shellinford

ortec.scientific.benchmarks.loadbuilding

turberfield-dialogue

pynuvo

odoo8-addon-sale-reason-to-export

pypmml

pyatomac

django-dajax-ng

django-image-loupe

spreadsheet-maker

binx

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

fuzzychinese 0.1.5

fuzzychinese的Python项目详细描述

fuzzychinese

安装说明

使用说明

其他功能

致谢

Installation

Quickstart

Other use

Credits

推荐PyPI第三方库

dscyd

collective.portlet.globalnav

odoo11-addon-purchase-location-by-line

empiricalutilities

shellinford

ortec.scientific.benchmarks.loadbuilding

turberfield-dialogue

pynuvo

odoo8-addon-sale-reason-to-export

pypmml

pyatomac

django-dajax-ng

django-image-loupe

spreadsheet-maker

binx

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签