我在一个集合上使用mongoexport,该集合包含以utf8编码的外来字符,以及包含mongoexport似乎正在编码的字符的字段(例如,'&;')。我注意到的是mongoexport对'&;'字符执行unicode转义,但对像'u'这样的字符没有转义。这带来了一个问题,因为我试图使用Python读取这些数据,但是由于有两种不同的编码方式,所以无法正确解码。在
例如(mongo query to get record):
db.Military_Handbooks.findOne({_id: ObjectId("5bf61c80e173a2a10b53ad39")}).PRIMARY_AUTHOR
[
"Dürer, Albrecht",
[
[
"http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order=",
" Dürer, Albrecht"
]
]
]
运行以下mongoexport命令(如果导出为json,则相同):
^{pr2}$在尝试将其读入Python时:
In [24]: import pandas
In [25]: c = pandas.read_csv('Military_Handbooks2.csv')
In [26]: c.at[1, 'PRIMARY_AUTHOR']
Out[26]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\\u0026tm_field_allauthr=Dürer, Albrecht\\u0026tm_translator=\\u0026tm_editor=\\u0026tm_field_short_title=\\u0026tm_field_imprint=\\u0026tm_field_place=\\u0026sm_field_year=\\u0026f_sm_field_year=\\u0026t_sm_field_year=\\u0026sm_field_country=\\u0026sm_field_lang=\\u0026sm_field_format=\\u0026sm_field_digital=\\u0026tm_field_class=\\u0026tm_field_cit_name=\\u0026tm_field_cit_no=\\u0026order="," Dürer, Albrecht"]]]'
In [27]: c.at[1, 'PRIMARY_AUTHOR'].encode().decode('unicode-escape')
Out[27]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order="," Dürer, Albrecht"]]]'
规格:
操作系统:Ubuntu 18.04.1 LTS
Python:3.6.7
MongoDB外壳版本v3.6.9
最终,在忽略错误的情况下重新编码文件似乎已经做到了这一点。在
我还应该注意到我链接到了这篇文章,这篇文章很有帮助:how to export correctly accented words with mongoexport。在
相关问题 更多 >
编程相关推荐