使用目录作为带有python`textblob的tfidf的输入`问题的回答

使用目录作为带有python`textblob的tfidf的输入`

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我试图调整这段代码（source found <a href="http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/" rel="nofollow noreferrer">here</a>）来遍历一个文件目录，而不是对输入进行硬编码。在 <pre><code>#!/usr/bin/python # -*- coding: utf-8 -*- from __future__ import division, unicode_literals import math from textblob import TextBlob as tb def tf(word, blob): return blob.words.count(word) / len(blob.words) def n_containing(word, bloblist): return sum(1 for blob in bloblist if word in blob) def idf(word, bloblist): return math.log(len(bloblist) / (1 + n_containing(word, bloblist))) def tfidf(word, blob, bloblist): return tf(word, blob) * idf(word, bloblist) document1 = tb("""Today, the weather is 30 degrees in Celcius. It is really hot""") document2 = tb("""I can't believe the traffic headed to the beach. It is really a circus out there.'""") document3 = tb("""There are so many tolls on this road. I recommend taking the interstate.""") bloblist = [document1, document2, document3] for i, blob in enumerate(bloblist): print("Document {}".format(i + 1)) scores = {word: tfidf(word, blob, bloblist) for word in blob.words} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) for word, score in sorted_words: score_weight = score * 100 print("\t{}, {}".format(word, round(score_weight, 5))) </code></pre> 我想在一个目录中使用一个输入txt文件，而不是每个硬编码的<code>document</code>。在 例如，假设我有一个目录<code>foo</code>，它包含三个文件<code>file1</code>，<code>file2</code>，<code>file3</code>。在 文件1包含<code>document1</code>包含的内容，即 文件1： ^{pr2}$ 文件2包含<code>document2</code>包含的内容，即 <pre><code>I can't believe the traffic headed to the beach. It is really a circus out there. </code></pre> 文件3包含<code>document3</code>包含的内容，即 <pre><code>There are so many tolls on this road. I recommend taking the interstate. </code></pre> 我不得不使用<code>glob</code>来实现我想要的结果，我提出了以下代码适配器，它正确地标识了文件，但不像原始代码那样单独处理它们： <pre><code>file_names = glob.glob("/path/to/foo/*") files = map(open,file_names) documents = [file.read() for file in files] [file.close() for file in files] bloblist = [documents] for i, blob in enumerate(bloblist): print("Document {}".format(i + 1)) scores = {word: tfidf(word, blob, bloblist) for word in blob.words} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) for word, score in sorted_words: score_weight = score * 100 print("\t{}, {}".format(word, round(score_weight, 5))) </code></pre> 如何使用<code>glob</code>维护每个文件的分数？在 在使用目录中的文件作为输入后，所需的结果将与原始代码相同[空间的结果排到前3位]： <pre><code>Document 1 Celcius, 3.37888 30, 3.37888 hot, 3.37888 Document 2 there, 2.38509 out, 2.38509 headed, 2.38509 Document 3 on, 3.11896 this, 3.11896 many, 3.11896 </code></pre> 类似的问题<a href="https://stackoverflow.com/questions/22434092/compute-tf-idf-with-corpus">here</a>没有完全解决问题。我想知道如何调用这些文件来计算<code>idf</code>，但要分别维护它们来计算完整的<code>tf-idf</code>？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

@annabanazzi在这里提供了一个代码片段，<a href="https://gist.github.com/sloria/6407257" rel="nofollow noreferrer">https://gist.github.com/sloria/6407257</a> <pre><code>import os, glob folder = "/path/to/folder/" os.chdir(folder) files = glob.glob("*.txt") # Makes a list of all files in folder bloblist = [] for file1 in files: with open (file1, 'r') as f: data = f.read() # Reads document content into a string document = tb(data.decode("utf-8")) # Makes TextBlob object bloblist.append(document) </code></pre> 我修改了它以供我使用（Python3）： ^{pr2}$ <hr/> 更新1: 我个人在使用pythonglob模块时遇到了困难，因为我经常（I）使用没有扩展名的文件名（例如01），以及（ii）希望在嵌套目录上递归。在 乍一看，“glob”方法似乎是一个简单的解决方案。但是，当我试图遍历glob返回的文件时，我经常会遇到错误（例如） <pre><code>IsADirectoryError: [Errno 21] Is a directory: ... </code></pre> 当循环遇到glob返回的目录名（不是文件名）时。在 在我看来，只要稍加努力，以下方法就更为有效： <pre><code>import os bloblist = [] def make_corpus(input_dir): for root, subdirs, files in os.walk(input_dir): for filename in files: f = os.path.join(root, filename) print('file:', f) with open(os.path.join(root, filename)) as f: for line in f: # print(line, end='') bloblist.append(line) # print('bloblist:\n', bloblist) print('len(bloblist):', len(bloblist), '\n') make_corpus('input') ## 'input' = input dir </code></pre> <hr/> 更新2: 最后一种方法（Linux shell<code>find</code>命令，适合在python3中使用）： <pre><code>import sh ## pip install sh def make_corpus(input_dir): '''find (here) matches filenames, excludes directory names''' corpus = [] file_list = [] #FILES = sh.find(input_dir, '-type', 'f', '-iname', '*.txt') ## find all .txt files FILES = sh.find(input_dir, '-type', 'f', '-iname', '*') ## find any file print('FILES:', FILES) ## caveat: files in FILES are '\n'-terminated ... for filename in FILES: #print(filename, end='') # file_list.append(filename) ## when printed, each filename ends with '\n' filename = filename.rstrip('\n') ## ... this addresses that issue file_list.append(filename) with open(filename) as f: #print('file:', filename) # # for general use: #for line in f: #print(line) #corpus.append(line) # # for this particular example (Question, above): data = f.read() document = tb(data) corpus.append(document) print('file_list:', file_list) print('corpus length (lines):', len(corpus)) with open('output/corpus', 'w') as f: ## write to file for line in corpus: f.write(line) </code></pre>

使用目录作为带有python`textblob的tfidf的输入`

1 个回答

相关Python问题