用Python计算大型文本中多词术语的频率问题的回答

用Python计算大型文本中多词术语的频率

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一本字典，里面有将近一百万个多词词汇（包含空格的词汇）。这看起来像 <pre class="lang-py prettyprint-override"><code>[..., 'multilayer ceramic', 'multilayer ceramic capacitor', 'multilayer optical disk', 'multilayer perceptron', ...] </code></pre> 我想用千兆字节的文本来计算它们的频率 作为一个小例子，考虑在维基百科页面中计算这四个多词表达式： <pre class="lang-py prettyprint-override"><code>payload = {'action': 'query', 'titles': 'Ceramic_capacitor', 'explaintext':1, 'prop':'extracts', 'format': 'json'} r = requests.get('https://en.wikipedia.org/w/api.php', params=payload) sampletext = r.json()['query']['pages']['9221221']['extract'].lower() sampledict = ['multilayer ceramic', 'multilayer ceramic capacitor', 'multilayer optical disk', 'multilayer perceptron'] termfreqdic = {} for term in sampledict: termfreqdic[term] = sampletext.count(term) print(termfreqdic) </code></pre> 这给出了类似于<code>{'multilayer ceramic': 7, 'multilayer ceramic capacitor': 2, 'multilayer optical disk': 0, 'multilayer perceptron': 0}</code>的结果，但如果字典包含一百万个条目，这似乎是次优的 我尝试过使用非常大的正则表达式： <pre class="lang-py prettyprint-override"><code>termlist = [re.escape(w) for w in open('termlistfile.txt').read().strip().split('\n')] termregex = re.compile(r'\b'+r'\b|\b'.join(termlist), re.I) termfreqdic = {} for i,li in enumerate(open(f)): for m in termregex.finditer(li): termfreqdic[m.group(0)]=termfreqdic.get(m.group(0),0)+1 open('counted.tsv','w').write('\n'.join([a+'\t'+v for a,v in termfreqdic.items()])) </code></pre> 这是非常慢的（在最近的i7上，1000行文字需要6分钟）。但是如果我用<code>regex</code>而不是<code>re</code>替换前两行，则每1000行文本的速度会下降到12秒左右，这对于我的需求来说仍然非常缓慢： <pre class="lang-py prettyprint-override"><code>termlist = open(termlistfile).read().strip().split('\n') termregex = regex.compile(r"\L<options>", options=termlist) ... </code></pre> 请注意，这并不完全符合我的要求，因为一个术语可能是另一个术语的子术语，如示例“多层陶瓷”和“多层陶瓷电容器”（也不包括<a href="https://stackoverflow.com/questions/47663870/find-multi-word-terms-in-a-tokenized-text-in-python">Find multi-word terms in a tokenized text in Python</a>中的第一标记化方法） 这看起来像是一个常见的序列匹配问题，无论是在文本语料库中还是在遗传字符串中，都必须有众所周知的解决方案。也许可以用一些<a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow noreferrer">trie</a>字来解决这个问题（我不介意术语表的初始编译速度太慢）？唉，我似乎没有找到合适的术语。也许有人能给我指出正确的方向

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

@SidharthMacherla让我走上了正确的道路（NLTK和标记化），尽管他的解决方案没有解决多词表达的问题，而且可能会重叠 简而言之，我找到的最好的方法是将NLTK的<code>MWETokenizer</code>子类化，并添加一个函数，使用util.Trie计算多个单词： <pre class="lang-py prettyprint-override"><code>import re, regex, timeit from nltk.tokenize import MWETokenizer from nltk.util import Trie class FreqMWETokenizer(MWETokenizer): """A tokenizer that processes tokenized text and merges multi-word expressions into single tokens. """ def __init__(self, mwes=None, separator="_"): super().__init__(mwes, separator) def freqs(self, text): """ :param text: A list containing tokenized text :type text: list(str) :return: A frequency dictionary with multi-words merged together as keys :rtype: dict :Example: >>> tokenizer = FreqMWETokenizer([ mw.split() for mw in ['multilayer ceramic', 'multilayer ceramic capacitor', 'ceramic capacitor']], separator=' ') >>> tokenizer.freqs("Gimme that multilayer ceramic capacitor please!".split()) {'multilayer ceramic': 1, 'multilayer ceramic capacitor': 1, 'ceramic capacitor': 1} """ i = 0 n = len(text) result = {} while i < n: if text[i] in self._mwes: # possible MWE match j = i trie = self._mwes while j < n and text[j] in trie: if Trie.LEAF in trie: # success! mw = self._separator.join(text[i:j]) result[mw]=result.get(mw,0)+1 trie = trie[text[j]] j = j + 1 else: if Trie.LEAF in trie: # success! mw = self._separator.join(text[i:j]) result[mw]=result.get(mw,0)+1 i += 1 else: i += 1 return result >>> tokenizer = FreqMWETokenizer([ mw.split() for mw in ['multilayer ceramic', 'multilayer ceramic capacitor', 'ceramic capacitor']], separator=' ') >>> tokenizer.freqs("Gimme that multilayer ceramic capacitor please!".split()) {'multilayer ceramic': 1, 'multilayer ceramic capacitor': 1, 'ceramic capacitor': 1} </code></pre> 以下是带有速度测量的测试套件： 使用FreqMWETokenizer计算10m字符中的10k多字项需要2秒，使用MWETokenizer需要4秒（也提供了完整的标记化，但不计算重叠），使用simple count方法需要150秒，使用大型正则表达式需要1000秒。尝试100m字符中的100k多字术语仍然可以使用标记化器，而不使用计数或正则表达式 对于测试，请在<a href="https://mega.nz/file/PsVVWSzA#5-OHy-L7SO6fzsByiJzeBnAbtJKRVy95YFdjeF_7yxA" rel="nofollow noreferrer">https://mega.nz/file/PsVVWSzA#5-OHy-L7SO6fzsByiJzeBnAbtJKRVy95YFdjeF_7yxA</a>找到两个大型示例文件 <pre class="lang-py prettyprint-override"><code> def freqtokenizer(thissampledict, thissampletext): """ This method uses the above FreqMWETokenizer's function freqs. It captures overlapping multi-words counting 1000 terms in 1000000 characters took 0.3222855870008061 seconds. found 0 terms from the list. counting 10000 terms in 10000000 characters took 2.5309120759993675 seconds. found 21 terms from the list. counting 100000 terms in 29467534 characters took 10.57763242800138 seconds. found 956 terms from the list. counting 743274 terms in 29467534 characters took 25.613067482998304 seconds. found 10411 terms from the list. """ tokenizer = FreqMWETokenizer([mw.split() for mw in thissampledict], separator=' ') thissampletext = re.sub(' +',' ', re.sub('[^\s\w\/\-\']+',' ',thissampletext)) # removing punctuation except /-'_ freqs = tokenizer.freqs(thissampletext.split()) return freqs def nltkmethod(thissampledict, thissampletext): """ This function first produces a tokenization by means of MWETokenizer. This takes the biggest matching multi-word, no overlaps. They could be computed separately on the dictionary. counting 1000 terms in 1000000 characters took 0.34804968100070255 seconds. found 0 terms from the list. counting 10000 terms in 10000000 characters took 3.9042628339993826 seconds. found 20 terms from the list. counting 100000 terms in 29467534 characters took 12.782784996001283 seconds. found 942 terms from the list. counting 743274 terms in 29467534 characters took 28.684293715999956 seconds. found 9964 terms from the list. """ termfreqdic = {} tokenizer = MWETokenizer([mw.split() for mw in thissampledict], separator=' ') thissampletext = re.sub(' +',' ', re.sub('[^\s\w\/\-\']+',' ',thissampletext)) # removing punctuation except /-'_ tokens = tokenizer.tokenize(thissampletext.split()) freqdist = FreqDist(tokens) termsfound = set([t for t in freqdist.keys()]) & set(thissampledict) for t in termsfound:termfreqdic[t]=freqdist[t] return termfreqdic def countmethod(thissampledict, thissampletext): """ counting 1000 in 1000000 took 0.9351876619912218 seconds. counting 10000 in 10000000 took 91.92642056700424 seconds. counting 100000 in 29467534 took 3185.7411157219904 seconds. """ termfreqdic = {} for term in thissampledict: termfreqdic[term] = thissampletext.count(term) return termfreqdic def regexmethod(thissampledict, thissampletext): """ counting 1000 terms in 1000000 characters took 2.298602456023218 seconds. counting 10000 terms in 10000000 characters took 395.46084802100086 seconds. counting 100000: impossible """ termfreqdic = {} termregex = re.compile(r'\b'+r'\b|\b'.join(thissampledict)) for m in termregex.finditer(thissampletext): termfreqdic[m.group(0)]=termfreqdic.get(m.group(0),0)+1 return termfreqdic def timing(): """ for testing, find the two large sample files at https://mega.nz/file/PsVVWSzA#5-OHy-L7SO6fzsByiJzeBnAbtJKRVy95YFdjeF_7yxA """ sampletext=open("G06K0019000000.txt").read().lower() sampledict=open("manyterms.lower.txt").read().strip().split('\n') print(len(sampletext),'characters',len(sampledict),'terms') for i in range(4): for f in [freqtokenizer, nltkmethod, countmethod, regexmethod]: start = timeit.default_timer() thissampledict = sampledict[:1000*10**i] thissampletext = sampletext[:1000000*10**i] termfreqdic = f(thissampledict, thissampletext) #termfreqdic = countmethod(thissampledict, thissampletext) #termfreqdic = regexmethod(thissampledict, thissampletext) #termfreqdic = nltkmethod(thissampledict, thissampletext) #termfreqdic = freqtokenizer(thissampledict, thissampletext) print('{f} counting {terms} terms in {characters} characters took {seconds} seconds. found {termfreqdic} terms from the list.'.format(f=f, terms=len(thissampledict), characters=len(thissampletext), seconds=timeit.default_timer()-start, termfreqdic=len({a:v for (a,v) in termfreqdic.items() if v}))) timing() </code></pre>

用Python计算大型文本中多词术语的频率

1 个回答

相关Python问题