MongoDB使用mongoexp时出现意外的字符编码

db.Military_Handbooks.findOne({_id: ObjectId("5bf61c80e173a2a10b53ad39")}).PRIMARY_AUTHOR [ "Dürer, Albrecht", [ [ "http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order=", " Dürer, Albrecht" ] ] ]

In [24]: import pandas In [25]: c = pandas.read_csv('Military_Handbooks2.csv') In [26]: c.at[1, 'PRIMARY_AUTHOR'] Out[26]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\\u0026tm_field_allauthr=Dürer, Albrecht\\u0026tm_translator=\\u0026tm_editor=\\u0026tm_field_short_title=\\u0026tm_field_imprint=\\u0026tm_field_place=\\u0026sm_field_year=\\u0026f_sm_field_year=\\u0026t_sm_field_year=\\u0026sm_field_country=\\u0026sm_field_lang=\\u0026sm_field_format=\\u0026sm_field_digital=\\u0026tm_field_class=\\u0026tm_field_cit_name=\\u0026tm_field_cit_no=\\u0026order="," Dürer, Albrecht"]]]' In [27]: c.at[1, 'PRIMARY_AUTHOR'].encode().decode('unicode-escape') Out[27]: '["DÃ¼rer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=DÃ¼rer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order="," DÃ¼rer, Albrecht"]]]'

1条回答

网友

1楼 · 发布于 2024-09-28 22:40:25

最终，在忽略错误的情况下重新编码文件似乎已经做到了这一点。在

def encoding():
    for fn in os.listdir('.'):
        if '2' not in fn and 'failed' not in fn and 'decode' not in fn:
            try:
                with codecs.open(fn, encoding='utf-8') as fd:
                    text = fd.read()
                    text = text.encode('Windows-1252', errors='ignore').decode('utf-8', errors='ignore')
                with codecs.open(fn[:fn.rfind('.')]+'2.csv', 'w', encoding='utf-8') as fd:
                        fd.write(text)
            except Exception as ex:
                print(ex)
                print('*'*50, '\n')

我还应该注意到我链接到了这篇文章，这篇文章很有帮助：how to export correctly accented words with mongoexport。在

相关问题更多 >

编程相关推荐

热门问题

热门文章