我有一个大字典(输出为366MB的字符串,~383764153行filetextfile),我想将它存储在数据库中,以便快速访问并跳过填充字典所需的计算时间。在
我的字典由一个包含文件名/内容对的字典组成。小子集:
{
'Reuters/19960916': {
'54826newsML': '<?xml version="1.0"
encoding="iso-8859-1" ?>\r\n<newsitem itemid="54826" id="root"
date="1996-09-16" xml:lang="en">\r\n<title>USA: RESEARCH ALERT -
Crestar Financial cut.</title>\r\n<headline>RESEARCH ALERT - Crestar
Financial cut.</headline>\r\n<text>\n<p>-- Salomon Brothers analyst
Carole Berger said she cut her rating on Crestar Financial Corp to
hold from buy, at the same time lowering her 1997 earnings per share
view to $5.40 from $5.85.</p>\n<p>-- Crestar said it would buy
Citizens Bancorp in a $774 million stock swap.</p>\n<p>-- Crestar
shares were down 2-1/2 at 58-7/8. Citizens Bancorp soared 14-5/8 to
46-7/8.</p>\n</text>\r\n<copyright>(c) Reuters Limited',
'55964newsML': '<?xml version="1.0" encoding="iso-8859-1"
?>\r\n<newsitem itemid="55964" id="root" date="1996-09-16"
xml:lang="en">\r\n<title>USA: Nebraska cattle sales thin at
$114/dressed-feedlot.</title>\r\n'
}
}
我原以为MongoDB很适合,但它似乎要求键和值都必须是Unicode,而且由于我是从namelist()
上的namelist()
获取文件名,所以不能保证是Unicode。在
你建议我如何将这本词典编入数据库?
pymongo并不要求字符串是unicode,它实际上按原样发送ascii字符串,并将unicode编码为UTF8。当从pymongo检索数据时,总是使用unicode。@@http://api.mongodb.org/python/2.0/tutorial.html#a-note-on-unicode-strings
如果您的输入包含具有高位字节的“国际”字节字符串(如
ab\xC3cd
),则需要将这些字符串转换为unicode或将其编码为UTF-8。下面是一个处理任意嵌套dict的简单递归转换器:如果您有RAM(显然是这样,因为您首先填充了字典)
cPickle
。或者,如果您想要一些需要较少RAM但速度较慢的设备shelve
。在相关问题 更多 >
编程相关推荐