使用词典对语料库进行规范化

with open("corpus.txt", 'r', encoding='utf8') as main: words = main.read().split() lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3', and so on} for x in lexnorm: for y in words: if lexnorm[x][0] == y: y == x[1] text = ' '.join(lexnorm.get(y, y) for y in words) print(text)

1条回答

网友

1楼 · 发布于 2024-10-01 15:28:35

将字典输出为文本文件的一种方法是作为JSON字符串：

import json

lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3'} # etc.

with open('lexnorm.txt', 'w') as f:
    json.dump(lexnorm, f)

请看我对你原作的评论。我只是猜测你想做什么：

import json, re

with open('lexnorm.txt') as f:
    lexnorm = json.load(f) # read back lexnorm dictionary

with open("corpus.txt", 'r', encoding='utf8') as main, open('new_corpus.txt', 'w') as new_main:
    for line in main:
        words = re.split(r'[^a-zA-z]+', line)
        for word in words:
            if word in lexnorm:
                line = line.replace(word, lexnorm[word])
        new_main.write(line)

上面的程序逐行读取corpus.txt文件，并尝试智能地将该行拆分为单词。在单个空间上拆分是不够的。考虑下面的句子：

'"The fox\'s foot grazed the sleeping dog, waking it."'

单个空间上的标准拆分产生：

['"The', "fox's", 'foot', 'grazed', 'the', 'sleeping', 'dog,', 'waking', 'it."']

您将永远无法匹配The、fox、dog或it

有几种处理方法。我正在拆分一个或多个非字母字符。如果lexnorm中的单词由a-z以外的字符组成，则可能需要“tweeked”：

re.split(r'[^a-zA-z]+',  '"The fox\'s foot grazed the sleeping dog, waking it."')

收益率：

['', 'The', 'fox', 's', 'foot', 'grazed', 'the', 'sleeping', 'dog', 'waking', 'it', '']

一旦行被拆分为单词，每个单词都会在lexnorm字典中查找，如果找到，则在原始行中对该单词进行简单替换。最后，该行和对该行所做的任何替换都将写入一个新文件。然后可以删除旧文件并重命名新文件

想一想，如果先将单词转换成小写，您将如何处理匹配的单词

更新（主要优化）

由于文件中可能存在大量重复字，因此优化是将每个唯一字处理一次，如果文件不是太大以至于无法读入内存，则可以这样做：

import json, re

with open('lexnorm.txt') as f:
    lexnorm = json.load(f) # read back lexnorm dictionary

with open("corpus.txt", 'r', encoding='utf8') as main:
    text = main.read()
word_set = set(re.split(r'[^a-zA-z]+', text))
for word in word_set:
    if word in lexnorm:
        text = text.replace(word, lexnorm[word])
with open("corpus.txt", 'w', encoding='utf8') as main:
    main.write(text)

在这里，整个文件被读入text，分割成单词，然后单词被添加到一个集合word_set，以保证单词的唯一性。然后在整个文本中查找并替换word_set中的每个单词，并将整个文本重写回原始文件

相关问题更多 >

编程相关推荐

热门问题

热门文章