解析文件中带有特殊字符的固定宽度文件?

2024-10-01 02:40:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析固定宽度的文件。我有一个特定的字符串问题。字符串如下所示:

(Pdb) record.description 'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'

我正在分析的固定宽度文件如下所示:

LI 41000001009 Décision financière à long trem corrigé 14 00001100 0000000000 0000000000 00080000 000000 00000 00000 00000 00081 N 05062006 00000273 00 00000000 00000001 00000000 00000000 -------- 000005

解析并导入数据库的代码如下:

import struct, cStringIO, MySQLdb, glob, os, settings
from django.template.defaultfilters import slugify

cnv_text = lambda s: s.rstrip()

fieldspecs = [
    ('plu_number', 3, 15, cnv_text),
    ('description', 19, 80, cnv_text),
    ('price', 104, 8, cnv_text),
    ('member_price', 113, 8, cnv_text),
]

fieldspecs.sort(key=lambda x: x[1])

unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
    start = fieldspec[1] - 1
    end = start + fieldspec[2]
    if start > unpack_len:
        unpack_fmt += str(start - unpack_len) + "x"
    unpack_fmt += str(end - start) + "s"
    unpack_len = end
field_indices = range(len(fieldspecs))
unpacker = struct.Struct(unpack_fmt).unpack_from

class Record(object):
    pass

path = settings.PATH
files_to_delete = settings.GUTTER

for fname in glob.glob(path):
    with open(fname, 'r') as f:
        f = cStringIO.StringIO(f.read())
        for line in f:
            raw_fields = unpacker(line)
            record = Record()
            for x in field_indices:
                setattr(record, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))

            db = MySQLdb.connect('localhost', settings.USER, settings.PASS, settings.DBNAME)
            cursor = db.cursor()
            fixed_member_price = int(record.member_price) / 100.0
            real_price = int(record.price) / 100.0
            try:
                cursor.execute(
                    "INSERT INTO catalog_product \
                     (name, slug, price, member_price, plu_number, description, old_price, is_active, is_featured, quantity, meta_description, image) \
                     VALUES \
                     ('%s', '%s', '%s', '%s', '%s', '%s', '00.00', false, false, 1, '', '/media/images/thumbnail-default.jpg')",
                     [record.description, slugify(record.description), str(real_price), str(fixed_member_price), record.plu_number, record.description]
                )
                db.commit()
            except:
                db.rollback()
            db.close()
for the_file in os.listdir(files_to_delete):
    file_path = os.path.join(files_to_delete, the_file)
    try:
        if os.path.isfile(file_path):
            os.unlink(file_path)
    except Exception, e:
        print e

这段代码非常适合一次导入数千条带有普通字符串的记录,但是一旦导入带有特殊字符,它就不会导入。我认为这是因为description字段从第19列开始,到80结束,特殊字符加上超过80的字符,它会出错,因为它无法映射其余字段。有没有人知道一种方法可以保留utf-8字符串格式,这样它就不会试图导入'D\xc3\xa9cision financi\xc3\xa8re \xc3\xa0 long trem corrig\xc3\xa9'?你知道吗


Tags: pathtextinforlensettingsosdescription