<p>无论是从你想要的输出还是从你的代码中都不清楚你到底想达到什么目的,但如果只是计算单个句子中的单词,那么策略应该是:</p>
<ol>
<li>将您的<code>common.txt</code>读入<code>set</code>以快速查找。在</li>
<li>阅读你的<code>sample.txt</code>并在<code>.</code>上分开,得到单独的句子。在</li>
<li>清除所有非单词字符(您必须定义它们或使用regex <code>\b</code>来捕获单词边界)并用空格替换它们。在</li>
<li>按空格分割并计算步骤1中<code>set</code>中不存在的单词。在</li>
</ol>
<p>所以:</p>
<pre><code>import collections
with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"" # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
# read the whole file to include linebreaks and split on `.` to get individual sentences
sentences = [s for s in f.read().split(".") if s.strip()] # ignore empty sentences
for sentence in sentences: # iterate over each sentence
sentence = sentence.translate(trans_table) # replace the interpunction with spaces
word_counter = collections.defaultdict(int) # a string:int default dict for counting
for word in sentence.split(): # split the sentence and iterate over the words
if word.lower() not in common_words: # count only words not in the common.txt
word_counter[word.lower()] += 1
sentences_counter.append(word_counter) # add the current sentence word count
</code></pre>
<p><em>注意:在python2.x上,使用<code>string.maketrans()</code>,而不是{<cd8>}。</em></p>
<p>这将生成<code>sentences_counter</code>,其中包含{<cd3>}中每个句子的字典计数,其中关键字是实际单词,其关联值是单词计数。您可以将结果打印为:</p>
^{pr2}$
<p>它将产生(对于您的示例数据):</p>
<pre>Sentence #1:
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
Sentence #2:
mississippi: 1
valley: 1
proper: 1
exceptionally: 1</pre>
<p>请记住,(英语)语言比这更复杂,例如,“<em>一只猫在生气时扭动它的尾巴,所以远离它。“</em>”在取决于你如何对待撇号上会有很大的不同。另外,一个点不一定表示句子的结尾。如果你想做严肃的语言分析,你应该研究一下{a1}。在</p>
<p><strong>更新</strong>:虽然我看不出重复每个单词重复数据的有用性(在一个句子中计数永远不会改变),但如果您想打印每个单词并将所有其他计数嵌套在下面,则可以在打印时添加一个内循环:</p>
<pre><code>for i, v in enumerate(sentences_counter):
print("Sentence #{}:".format(i+1))
for word, count in v.items():
print("\t{} {}".format(word, count))
print("\n".join("\t\t{}: {}".format(w, c) for w, c in v.items() if w != word))
</code></pre>
<p>这会给你:</p>
<pre>Sentence #1:
area 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
drainage-basin 1
area: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
great 1
area: 1
drainage-basin: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
combined 1
area: 1
drainage-basin: 1
great: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
areas 1
area: 1
drainage-basin: 1
great: 1
combined: 1
england: 1
wales: 1
wide: 1
region: 1
fertile: 1
england 1
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
wales: 1
wide: 1
region: 1
fertile: 1
wales 1
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wide: 1
region: 1
fertile: 1
wide 1
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
region: 1
fertile: 1
region 1
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
fertile: 1
fertile 1
area: 1
drainage-basin: 1
great: 1
combined: 1
areas: 1
england: 1
wales: 1
wide: 1
region: 1
Sentence #2:
mississippi 1
valley: 1
proper: 1
exceptionally: 1
valley 1
mississippi: 1
proper: 1
exceptionally: 1
proper 1
mississippi: 1
valley: 1
exceptionally: 1
exceptionally 1
mississippi: 1
valley: 1
proper: 1</pre>
<p>请随意删除打印的句子编号,并减少一个制表符缩进,以便从您的问题中获得更多想要的输出。您还可以构建一个树型字典,而不是将所有内容打印到标准输出(STDOUT),如果您更喜欢的话。在</p>
<p><strong>更新2</strong>:如果您愿意,您不必为<code>common_words</code>使用<code>set</code>。在本例中,它几乎可以与<code>list</code>互换,因此您可以使用<a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions" rel="nofollow noreferrer">list comprehension</a>而不是{a3}(即用方括号替换curly),但是查看<code>list</code>是一个<code>O(n)</code>操作,而<code>set</code>查找是{<cd17>}操作,因此这里首选<code>set</code>。更不用说自动重复数据消除在<code>common.txt</code>有重复字时的附带好处。在</p>
<p>至于<a href="https://docs.python.org/3/library/collections.html#collections.defaultdict" rel="nofollow noreferrer">^{<cd20>}</a>它只是为了节省一些编码/检查,只要有人请求,它就会自动将字典初始化为一个键—如果没有它,您就必须手动执行:</p>
<pre><code>with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"" # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
# read the whole file to include linebreaks and split on `.` to get individual sentences
sentences = [s for s in f.read().split(".") if s.strip()] # ignore empty sentences
for sentence in sentences: # iterate over each sentence
sentence = sentence.translate(trans_table) # replace the interpunction with spaces
word_counter = {} # initialize a word counting dictionary
for word in sentence.split(): # split the sentence and iterate over the words
word = word.lower() # turn the word to lowercase
if word not in common_words: # count only words not in the common.txt
word_counter[word] = word_counter.get(word, 0) + 1 # increase the last count
sentences_counter.append(word_counter) # add the current sentence word count
</code></pre>
<p><strong>更新3</strong>:如果您只想在所有句子中列出一个原始单词表,就像上次更新问题时一样,您甚至不需要考虑句子本身-只需在函数间列表中添加一个点,逐行阅读文件,在空白处拆分,并像之前一样计算单词数:</p>
<pre><code>import collections
with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"." # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
word_counter = collections.defaultdict(int) # a string:int default dict for counting
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
for line in f: # read the file line by line
for word in line.translate(trans_table).split(): # remove interpunction and split
if word.lower() not in common_words: # count only words not in the common.txt
word_counter[word.lower()] += 1 # increase the count
print("\n".join("{}: {}".format(w, c) for w, c in word_counter.items())) # print the counts
</code></pre>