有人有针对NLTK的分类XML语料库阅读器吗？问题的回答

有人有针对NLTK的分类XML语料库阅读器吗？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这是一个用于NLTK的分类XML语料库阅读器。它基于<a href="https://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora" rel="nofollow">this tutorial.</a> 这使您可以在XML语料库上使用NLTK的基于类别的特性，如《纽约时报》带注释的语料库。在 把这个文件叫做CategorizedXMLCorpusReader.py并将其导入为： <pre><code>import imp CatXMLReader = imp.load_source('CategorizedXMLCorpusReader','PATH_TO_THIS_FILE/CategorizedXMLCorpusReader.py') </code></pre> 然后您可以像其他NLTK阅读器一样使用它。例如 ^{pr2}$ 我仍在考虑NLTK，所以欢迎任何更正或建议。在 <pre><code># Categorized XML Corpus Reader from nltk.corpus.reader import CategorizedCorpusReader, XMLCorpusReader class CategorizedXMLCorpusReader(CategorizedCorpusReader, XMLCorpusReader): def __init__(self, *args, **kwargs): CategorizedCorpusReader.__init__(self, kwargs) XMLCorpusReader.__init__(self, *args, **kwargs) def _resolve(self, fileids, categories): if fileids is not None and categories is not None: raise ValueError('Specify fileids or categories, not both') if categories is not None: return self.fileids(categories) else: return fileids # All of the following methods call the corresponding function in ChunkedCorpusReader # with the value returned from _resolve(). We'll start with the plain text methods. def raw(self, fileids=None, categories=None): return XMLCorpusReader.raw(self, self._resolve(fileids, categories)) def words(self, fileids=None, categories=None): #return CategorizedCorpusReader.words(self, self._resolve(fileids, categories)) # Can I just concat words over each file in a file list? words=[] fileids = self._resolve(fileids, categories) # XMLCorpusReader.words works on one file at a time. Concatenate them here. for fileid in fileids: words+=XMLCorpusReader.words(self, fileid) return words # This returns a string of the text of the XML docs without any markup def text(self, fileids=None, categories=None): fileids = self._resolve(fileids, categories) text = "" for fileid in fileids: for i in self.xml(fileid).getiterator(): if i.text: text += i.text return text # This returns all text for a specified xml field def fieldtext(self, fileids=None, categories=None): # NEEDS TO BE WRITTEN return def sents(self, fileids=None, categories=None): #return CategorizedCorpusReader.sents(self, self._resolve(fileids, categories)) text = self.words(fileids, categories) sents=nltk.PunktSentenceTokenizer().tokenize(text) return sents def paras(self, fileids=None, categories=None): return CategorizedCorpusReader.paras(self, self._resolve(fileids, categories)) </code></pre>

有人有针对NLTK的分类XML语料库阅读器吗？

1 个回答

相关Python问题