NLTK中是否有用于文本规范化和规范化的类？问题的回答

NLTK中是否有用于文本规范化和规范化的类？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

另外，在NLTK规范中，许多（子）任务都是使用纯python<a href="http://docs.python.org/release/2.5.2/lib/string-methods.html">methods</a>解决的。 a）将所有字母转换为小写或大写 <pre><code>text='aiUOd' print text.lower() >> 'aiuod' print text.upper() >> 'AIUOD' </code></pre> b）删除标点符号 <pre><code>text='She? Hm, why not!' puncts='.?!' for sym in puncts: text= text.replace(sym,' ') print text >> 'She Hm why not ' </code></pre> c）将数字转换为单词 在这里，写一个fewliner并不是一件容易的事，但是如果你用google搜索的话，已经有很多解决方案了。<a href="http://www.daniweb.com/software-development/python/code/216839">Code snippets</a>，<a href="http://code.google.com/p/numword/">libraries</a>等 d）删除重音符号和其他音调符号 查找pointb），只需创建一个带有发音符号的列表，如puncts e）扩展缩写 使用缩写词创建词典： <pre><code>text='USA and GB are ...' abbrevs={'USA':'United States','GB':'Great Britain'} for abbrev in abbrevs: text= text.replace(abbrev,abbrevs[abbrev]) print text >> 'United States and Great Britain are ...' </code></pre> f）删除停止字或“太常见”字 创建一个包含停止字的列表： <pre><code>text='Mary had a little lamb' temp_corpus=text.split(' ') stops=['a','the','had'] corpus=[token for token in temp_corpus if token not in stops] print corpus >> ['Mary', 'little', 'lamb'] </code></pre> g）文本规范化（tumor=tumor，it's=it is） 对于肿瘤-&gt；肿瘤使用<a href="http://docs.python.org/library/re.html">regex</a>。 最后，但并非最不重要的是，请注意，上面所有的例子通常需要对真实的文本进行校准，我把它们作为前进的方向来写。

NLTK中是否有用于文本规范化和规范化的类？

1 个回答

相关Python问题