从大的.txt文件生成模型读取语料库时出错

2024-10-05 10:06:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试读取文件corpus.txt(training set)并生成一个模型,输出必须被称为lexic.txt并包含单词、标记和ocurrences数…对于小的训练集,它可以工作,但是对于大学给定的训练集(30mb.txt文件,数百万行),代码不能工作,我想这将是一个效率问题,因此系统内存不足…谁能帮我的代码请

在此附上我的代码:

from collections import Counter

file=open('corpus.txt','r')
data=file.readlines()
file.close()

palabras = []
count_list = []

for linea in data:
   linea.decode('latin_1').encode('UTF-8') # para los acentos
   palabra_tag = linea.split('\n')
   palabras.append(palabra_tag[0])

cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag 

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i)


outfile = open('lexic.txt', 'w') 
outfile.write('Palabra\tTag\tApariciones\n')

for i in range(len(finalList)):
    outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences

outfile.close()

在这里您可以看到corpus.txt的一个示例:

Al  Prep
menos   Adv
cinco   Det
reclusos    Adj
murieron    V
en  Prep
las Det
últimas Adj
24  Num
horas   NC
en  Prep
las Det
cárceles    NC
de  Prep
Valencia    NP
y   Conj
Barcelona   NP
en  Prep
incidentes  NC
en  Prep
los Det
que Pron
su  Det

提前谢谢


Tags: 代码intxtfortagcountcorpusoutfile
2条回答

如果将这两段代码组合起来,就可以减少内存使用

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag:       
            count_list.append([palabras[i], str(cuenta[palabraTag])])


#We delete repeated ones
finalList = []
for i in count_list:
    if i not in finalList:
        finalList.append(i) 

您可以检查盘点列表中是否已经存在一个项目,这样就不会首先添加重复项。这样可以减少内存使用。见下文

#Assign for every word + tag the number of times appears
for palabraTag in palabras:
    for i in range(len(palabras)):
        if palabras[i] == palabraTag and
           [palabras[i], str(cuenta[palabraTag])] not in count_list:
                count_list.append([palabras[i], str(cuenta[palabraTag])])

最后,我使用dictionary改进了代码,以下是100%正常工作的结果:

file=open('corpus.txt','r')
data=file.readlines()
file.close()

diccionario = {}

for linea in data:
    linea.decode('latin_1').encode('UTF-8') # para los acentos
    palabra_tag = linea.split('\n')
    cadena = str(palabra_tag[0])
    if(diccionario.has_key(cadena)):
        aux = diccionario.get(cadena)
        aux += 1
        diccionario.update({cadena:aux})
    else:
        diccionario.update({cadena:1})

outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')

for key, value in diccionario.iteritems() :
    s = str(value)
    outfile.write(key +" "+s+'\n')
outfile.close()

相关问题 更多 >

    热门问题