<p>如果您只想对文档中的句子进行POS标记,并将包含N个以上所选POS的句子转储到文件中,则不需要您发布的第二个脚本。在</p>
<p>这是一个极其简化的例子:</p>
<pre><code>import os
import nlpnet
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.
def is_worth_saving(text, pos, pos_count):
# tagged sentences are lists of tagged words, which in
# nlpnet are (word, pos) tuples. Tagged texts may contain
# several sentences.
pos_words = [word for sentence in TAGGER.tag(text)
for word in sentence
if word[1] == pos]
return len(pos_words) >= pos_count
# Then you'd just need to open your original file, read a sentence, tag
# it, decide if it's worth saving, and save it or not. Until you consume
# the entire original file. Thus not loading the entire dataset in memory
# and keeping a small memory footprint.
with open('opiniaoaborto.txt', encoding='utf8') as original_file:
with open('oracaos_interessantes.txt', 'w') as output_file:
for text in original_file:
# For example, only save sentences with more than 5 verbs in it
if is_worth_saving(text, 'V', 5):
output_file.write(text + os.linesep)
</code></pre>
<p>回答你的跟进。你要检查一个句子是否包含5个单词,这些单词都用给定列表中的任何词性标记。我设想两种情况:</p>
<p>A)这5个词必须属于同一个词性。例如,含有5个动词(‘Comendo,dançando,procurando,olhando e falando’)或5个名词(‘O gato,O sapo,O cãO,O loro e O ratãO foram as compras'),而不是5个动词+名词(‘O gato esta querendo comer O ratãO’[2个名词,3个动词])。在</p>
^{pr2}$
<p>B)句子包含5个词性词组,由列表中任意一个词组的和组成。例如:“O gato esta querendo comer O ratãO”(2个名词+3个动词)</p>
<pre><code>import os
import nlpnet
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
# Again, one of the arguments would have to take a list of valid POS
def is_worth_saving(text, pos_list, pos_count):
pos_words = [word for sentence in TAGGER.tag(text)
for word in sentence
if word[1] in pos_list]
return len(pos_words) >= pos_count
with open('opiniaoaborto.txt', encoding='utf8') as original_file:
with open('oracaos_interessantes.txt', 'w') as output_file:
for text in original_file:
# For example, only save sentences whose sum of verbs and nouns count is 5
if is_worth_saving(text, ['V', 'N'], 5):
output_file.write(text + os.linesep)
</code></pre>