python2.7,xml,beautifulsoup4:仅返回匹配的父标记

2024-09-28 18:50:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图解析一些XML,但是遇到了一个问题:如果请求标记是父标记,则强制它只选择请求标记。例如,我的XML的一部分是:

<Messages>
    <Message ChainCode="LI" HotelCode="5501" ConfirmationID="5501">
      <MessageContent>
        <OTA_HotelResNotifRQ TimeStamp="2014-01-24T21:02:43.9318703Z" Version="4" ResStatus="Book">
          <HotelReservations>
            <HotelReservation>
              <RoomStays>
                <RoomStay MarketCode="CC" SourceOfBusiness="CRS">
                  <RoomRates>
                    <RoomRate EffectiveDate="2014-02-04" ExpireDate="2014-02-06" RoomTypeCode="12112" NumberOfUnits="1" RatePlanCode="RAC">
                      <Rates>
                        <Rate EffectiveDate="2014-02-04" ExpireDate="2014-02-06" RateTimeUnit="Day" UnitMultiplier="3">
                          <Base AmountBeforeTax="749.25" CurrencyCode="USD" />
                          <Total AmountBeforeTax="749.25" CurrencyCode="USD" />
                        </Rate>
                      </Rates>
                    </RoomRate>
                  </RoomRates>
                  <Total AmountBeforeTax="2247.75" CurrencyCode="USD">
                    <Taxes Amount="0.00" />
                  </Total>
                </RoomStay>
              </RoomStays>
            </HotelReservation>
          </HotelReservations>
        </OTA_HotelResNotifRQ>
      </MessageContent>
    </Message>
  </Messages>

除了“Total”标签外,我已经把所有的东西都解析了。你知道吗

我想得到的总标签是:

 <Total AmountBeforeTax="2247.75" CurrencyCode="USD">
     <Taxes Amount="0.00" />
 </Total>

所发生的是,它返回的“Total”标记是RoomRates\RoomRate\Rates\Rate的子级。我正在尝试找出如何将其指定为只返回RoomStays\RoomStay\Total标记。我现在拥有的是:

soup = bs(response, "xml")

messages = soup.find_all('Message')

for message in messages:
    hotel_code = message.get('HotelCode')

    reservations = message.find_all('HotelReservation')
    for reservation in reservations:
        uniqueid_id = reservation.UniqueID.get('ID')
        uniqueid_idcontext = reservation.UniqueID.get('ID_Context')

        roomstays = reservation.find_all('RoomStay')
        for roomstay in roomstays:

            total = roomstay.Total

关于如何指定我要拉的确切标签有什么想法吗?如果有人想知道for循环,那是因为通常有多个“Message”、“Hotel Reservation”、“Room Stay”等标签,但我已经删除了它们,只显示一个。有时也可能有多个Rate\Rates标签,所以我不能要求它给我第二个“Total”标签。你知道吗

希望我已经解释好了。你知道吗


Tags: 标记messageforrate标签totalusdrates
1条回答
网友
1楼 · 发布于 2024-09-28 18:50:39

There can also sometimes be multiple Rate\Rates tags, so I can't just ask it to give me the 2nd "Total" tag.

为什么不迭代所有Total标记,跳过那些没有Taxes子标记的标记呢?你知道吗

reservations = message.find_all('HotelReservation')
for reservation in reservations:
    totals = reservation.find_all('Total')
    for total in totals:
        if total.find('Taxes'):
             # do stuff
        else:
             # these aren't the totals you're looking for

如果您更普遍地希望消除那些没有子节点的节点,可以执行以下任一操作:

if next(total.children, None):
    # it's a parent of something

if total.contents:
    # it's a parent of something

或者你可以use a function instead of a string as your filter

total = reservation.find(lambda node: node.name == 'Total' and node.contents)

或者你可以用其他方法来定位这个标签:它是RoomStay的直接子代,而不仅仅是子代;它不是Rate的子代;它是RoomStay下的最后一个Taxes子代;等等。所有这些都可以很容易地完成。你知道吗


也就是说,这似乎是XPath的完美工作,BeautifulSoup不支持,但ElementTreelxml支持

相关问题 更多 >