在python中将emdash转换为连字符

NoDemande NoUsager Sens IdVehicule NoConduteur HeureDebutTrajet HeureArriveeSurSite HeureEffective' 42192001801 42192002715 — 157Véh 42192000153 ... 42192000003 42192002021 + 157Véh 42192000002 ... 42192001833 42192000485 — 324My3FVéh 42192000157 ...

42191001122 42191002244 ? 181Véh 42191000114 ... 42191001293 42191001203 ? 319M9pVéh 42191000125 ... 42191000700 42191000272 ? 183Véh 42191000072 ...

42191001122 42191002244 â?? 181VÃ©h 42191000114 ... 42191001293 42191001203 â?? 319M9pVÃ©h 42191000125 ... 42191000700 42191000272 â?? 183VÃ©h 42191000072 ...

'"42191002384";"42191000118";"\xe2\x80\x94";"";"42191000182";... '"42191002464";"42191001671";"+";"";"42191000182";... '"42191000045";"42191000176";"\xe2\x80\x94";"620M9pV\xc3\xa9h";"42191000003";... '"42191001305";"42191000823";"\xe2\x80\x94";"310V7pV\xc3\xa9h";"42191000126";...

1条回答

网友

1楼 · 发布于 2024-10-03 21:29:56

u'\u2014'（EM DASH）不能在latin1/iso-8859-1中编码，因此值不能出现在正确编码的latin1文件中。在

可能文件被编码为windows-1252，u'\u2014'可以被编码为'\x97'。在

另一个问题是CSV文件显然使用空格作为列分隔符，但代码使用分号。可以使用delim_whitespace=True指定空格作为分隔符：

df = pd.read_csv(file_, delim_whitespace=True)

也可以使用encoding参数指定文件的编码。read_csv()将把传入的数据转换为unicode：

^{pr2}$

在Python2中（我认为您正在使用它），如果您不指定编码，数据将保持在原始编码中，这可能是您的替换无法工作的原因。在

正确加载文件后，可以像以前一样替换字符：

df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
df['Sens'].replace(u'\u2014', '-', inplace=True)

编辑

在更新后显示repr()输出，文件将显示为UTF8编码，而不是latin1，也不是Windows-1252。由于您使用的是Python 2，因此在加载CSV文件时需要指定编码：

df = pd.read_csv(file_, sep=';', encoding='utf8')
df['Sens'].replace(u'\u2014', '-', inplace=True)

因为您指定了一种编码，read_csv()将把传入的数据转换成unicode，因此replace()现在应该如上面所示工作。应该很容易。在

相关问题更多 >

编程相关推荐

热门问题

热门文章