我正在分析固定宽度的文件。我有一个特定的字符串问题。字符串如下所示:
(Pdb) record.description
'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'
我正在分析的固定宽度文件如下所示:
LI 41000001009 Décision financière à long trem corrigé 14 00001100 0000000000 0000000000 00080000 000000 00000 00000 00000 00081 N 05062006 00000273 00 00000000 00000001 00000000 00000000 -------- 000005
解析并导入数据库的代码如下:
import struct, cStringIO, MySQLdb, glob, os, settings
from django.template.defaultfilters import slugify
cnv_text = lambda s: s.rstrip()
fieldspecs = [
('plu_number', 3, 15, cnv_text),
('description', 19, 80, cnv_text),
('price', 104, 8, cnv_text),
('member_price', 113, 8, cnv_text),
]
fieldspecs.sort(key=lambda x: x[1])
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[1] - 1
end = start + fieldspec[2]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
field_indices = range(len(fieldspecs))
unpacker = struct.Struct(unpack_fmt).unpack_from
class Record(object):
pass
path = settings.PATH
files_to_delete = settings.GUTTER
for fname in glob.glob(path):
with open(fname, 'r') as f:
f = cStringIO.StringIO(f.read())
for line in f:
raw_fields = unpacker(line)
record = Record()
for x in field_indices:
setattr(record, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
db = MySQLdb.connect('localhost', settings.USER, settings.PASS, settings.DBNAME)
cursor = db.cursor()
fixed_member_price = int(record.member_price) / 100.0
real_price = int(record.price) / 100.0
try:
cursor.execute(
"INSERT INTO catalog_product \
(name, slug, price, member_price, plu_number, description, old_price, is_active, is_featured, quantity, meta_description, image) \
VALUES \
('%s', '%s', '%s', '%s', '%s', '%s', '00.00', false, false, 1, '', '/media/images/thumbnail-default.jpg')",
[record.description, slugify(record.description), str(real_price), str(fixed_member_price), record.plu_number, record.description]
)
db.commit()
except:
db.rollback()
db.close()
for the_file in os.listdir(files_to_delete):
file_path = os.path.join(files_to_delete, the_file)
try:
if os.path.isfile(file_path):
os.unlink(file_path)
except Exception, e:
print e
这段代码非常适合一次导入数千条带有普通字符串的记录,但是一旦导入带有特殊字符,它就不会导入。我认为这是因为description字段从第19列开始,到80结束,特殊字符加上超过80的字符,它会出错,因为它无法映射其余字段。有没有人知道一种方法可以保留utf-8字符串格式,这样它就不会试图导入'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'
?你知道吗
这就是UTF-8字符串。你知道吗
相关问题 更多 >
编程相关推荐