如何迭代.txt文件以创建XML

2024-10-01 02:26:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我想通过.txt文件创建一个XML文件,如下所示:

1header
2client_1
3total_1
4promo_1
5promo_1_data
2client_2
3total_1
3total_2
4promo_1
4promo_1
4promo_1
5promo_1_data
4promo_2
5promo_2_data

每行以一个特定的数字开始,该数字是记录类型的id。 例如,数字2是新客户机的id,因此,在迭代获得另一个id 2并使用其特定数据创建另一个摘要之前,它后面的所有记录都属于该摘要

也就是说,所需的XML输出是:

<SummaryList>
    <Summary>
        <Client_1></Client_1>
        <Total_1></Total_1>
        <Client_Promotions>
            <Promotion>
                <Promo_1></Promo_1>
                <Promo_1_data></Promo_1_data>
            <Promotion>
        </Client_Promotions>
    </Summary>
    <Summary>
        <Client_2></Client_2>
        <Total_1></Total_1>
        <Total_2></Total_2>
        <Client_Promotions>
            <Promotion>
                <Promo_1></Promo_1>
                <Promo_1></Promo_1>
                <Promo_1></Promo_1>          
                <Promo_1_data></Promo_1_data>
            <Promotion>
            <Promotion>
                <Promo_2></Promo_2>
                <Promo_2_data></Promo_2_data>
            </Promotion>
        </Client_Promotions>
    </Summary>
</SummaryList>

我一直在尝试:

filepath = 'data.txt'
lst = []
with open(filepath) as fp:
   line = fp.readline()
   while line:
       line = fp.readline()
       lst.append(line)

为了创建一个列表并将每一行划分为一个项目,我可以这样迭代:

from lxml import etree as ET
root = ET.Element('SummaryList')
root.text = '\n'
max_lines = len(lst)
line_number = 0
while line_number <= max_lines:
    if lst[line_number][0] == '2':
        a = lst[line_number]
        summary = ET.Element('Summary')
        summary.text = '\n'
        root.append(summary)
        dc = ET.Element("Client")
        dc.text = '\n'
        summary.append(dc)
        e = ET.SubElement(dc, "someClientData")
        e.text = a[1:19].strip()
        e.tail = '\n'
        dc.tail = '\n'
        totalDescount = ET.Element("Total")  # id record type 3
        totalDescount.text = '\n'
        summary.append(totalDescount)
        promoDetail  = ET.Element("ClientPromotions")  # id record type 4
        promoDetail.text = '\n'
        summary.append(promoDetail)
        summary.tail = '\n'
        
    if lst[line_number][0] == '3':
        a = lst[line_number]
        subtotal = ET.Element("SubTotal")
        subtotal.text = '\n'
        totalDescount.append(subtotal)
        e = ET.Element("description")
        e.text = a[1:101].strip()
        e.tail = '\n'
        subtotal.append(e)
        e = ET.Element("someData")
        e.text = format_money(a[101:116])
        e.tail = '\n'
        subtotal.append(e)
        subtotal.tail = '\n'
        totalDescount.tail = '\n'

    if lst[line_number][0] == '4':
        a = lst[line_number]
        promotion = ET.SubElement(promoDetail, "Promotion")  # registro 4 sub array
        promotion.text = '\n'
        promo = ET.Element("Promo")
        promo.text = '\n'
        promotion.append(promo)
        e = ET.SubElement(promo, "someClientData")
        e.text = a[1:19]
        e.tail = '\n'
        promo.tail = '\n'
        promotion.tail = '\n'
        promoDetail= '\n'

    if lst[line_number][0] == '5':
        a = lst[line_number]
        promoData = ET.Element("Promo_data")
        promoData.text = '\n'
        promotion.append(promoData)
        promoData.tail = '\n'
    line_number += 1

我在摘要迭代中创建promoDetail,因为我在每个摘要中只需要这些标记中的一个。但我需要为每个id记录类型4创建提升标记,直到它找到一个id记录类型5。 我没有能力做到这一点。我得到了这个错误:

Traceback (most recent call last):
  File "C:/Users/tartega/PycharmProjects/LP/test.py", line 211, in <module>
    promotion = ET.SubElement(promoDetail, "Promotion") 
TypeError: Argument '_parent' has incorrect type (expected lxml.etree._Element, got str)

如果我在此处添加打印:

if lst[line_number][0] == '4':
       a = lst[line_number]
       print(type(promoDetail))

我得到了这个结果:

<class 'lxml.etree._Element'>
<class 'str'>

看起来第一次迭代进行得很顺利,但是第二次迭代没有创建这个元素。 你能帮我做这个吗?我是Python和lxml的新手。谢谢


Tags: textclientidnumberdatalineelementet
2条回答

主要的问题是,每次我发现一个以2开头的元素时,我都想为以4开头的元素创建一个父标记,所以要解决这个问题,我唯一需要做的就是在找到一个元素2之后,不再逐行读取,而是将45分组进入两个不同的数组,然后我只使用45进行迭代,它们与元素2上的确切客户机匹配

while line_number <= max_lines:
    if lst[line_number][0] == '2':
        a = lst[line_number]
        five = []  # Only id 5 for this resume
        for i in only_fives:
            if i[168:180] in a[19:32]:
                five.append(i)
        four = []
        for i in only_fours:
            if i[19:32] in a[19:32]:
                four.append(i)

这是将数据创建到标签4和5中的功能:

def fours_and_fives(fours, fives, parent):
    for d2 in fives:
        PROMOTION= ET.SubElement(parent, "Promotion")
        PROMOTION.text = '\n'
        title= ET.SubElement(PROMOTION, "PromotionTitle")
        title.text = d2[1:51].strip()
        title.tail = '\n'
        for d1 in fours:
            if d1[139:189].strip() in d2[1:51].strip():
                dt = ET.SubElement(PROMOTION, "promo")
                dt.text = '\n'
                a = d1
                e = ET.SubElement(dt, "name")
                e.text = a[1:19]
                e.tail = '\n'
                e = ET.SubElement(dt, "idClient")
                e.text = a[19:32].lstrip('0')
                e.tail = '\n'
                e = ET.SubElement(dt, "card")
                e.text = a[32:52].strip()
                e.tail = '\n'
                e = ET.SubElement(dt, "lastDigits")
                e.text = a[52:56]
                e.tail = '\n'
                e = ET.SubElement(dt, "date")
                e.text = format_fecha(a[56:64])
                e.tail = '\n'
                e = ET.SubElement(dt, "nameMarket")
                e.text = a[64:114].strip()
                e.tail = '\n'
                e = ET.SubElement(dt, "coupon")
                e.text = a[114:122]
                e.tail = '\n'
                e = ET.SubElement(dt, "numberOfPayments")
                e.text = a[122:124]
                e.tail = '\n'
                e = ET.SubElement(dt, "spending")
                e.text = format_money(a[124:139])
                e.tail = '\n'
                e = ET.SubElement(dt, "promotion")
                e.text = a[139:189].strip()
                e.tail = '\n'
                e = ET.SubElement(dt, "refund")
                e.text = format_money(a[189:204])
                e.tail = '\n'
                e = ET.SubElement(dt, "thing")
                e.text = format_money(a[204:219])
                e.tail = '\n'
                e = ET.SubElement(dt, "dateRefund")
                e.text = format_fecha(a[219:227])
                e.tail = '\n'
                dt.tail = '\n'
                st = ET.SubElement(PROMOTION, "PromoSubtotal")  # Elements 5
                st.text = '\n'
                e = ET.SubElement(st, "promotion")
                e.text = d2[1:51].strip()
                e.tail = '\n'
                e = ET.SubElement(st, "description")
                e.text = d2[51:151].strip()
                e.tail = '\n'
                e = ET.SubElement(st, "spending")
                e.text = format_money(d2[151:166])
                e.tail = '\n'
                st.tail = '\n'
        PROMOCION.tail = '\n'

format_money是一个函数,它将数字串格式化为带有千点和十进制逗号的数字

format_fecha是另一个格式化日期的函数

您得到的错误是因为您正在为此行中的promoData分配一个字符串:

promoDetail= '\n'

鉴于您在这一行中假设它是一个ET.Element

promotion = ET.SubElement(promoDetail, "Promotion")

从上下文来看,您的意思是promoDetail.tail = '\n'


还有一些技巧,不是直接针对你的问题:

  • 您可以使用fp.readlines()fp的行读入列表。如果您想去掉第一行(这在代码段中实际上不是必需的),可以使用fp.readlines()[1:]
  • 由于Python使用zero-based indexing,因此max_lines - 1是代码中lst的最大索引。因此,while循环只需持续line_number < max_lines(而不是'line\u number<;=max\u line)
  • 要迭代列表中的项,同时跟踪项的索引,可以使用^{}(在您的示例中为for line_number, line in enumerate(lst))。这样,您就不必一次又一次地写lst[line_number]

相关问题 更多 >