从infotable中提取数据,同时填充缺少的值

2024-09-29 23:20:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个.txt文件,它的格式像XML,但问题是,网站检索到它警告我这是一个无效的XML格式。通过一些解析,我成功地用infoTable作为参考,在这些小片段中获得了我的信息。你知道吗

<infoTable> <nameOfIssuer>COMPANYONE</nameOfIssuer> <titleOfClass>SHS CLASS -A -</titleOfClass> <cusip>00000</cusip> <value>21944</value> <shrsOrPrnAmt> <sshPrnamt>3060500</sshPrnamt> <sshPrnamtType>SH</sshPrnamtType> </shrsOrPrnAmt> <investmentDiscretion>SOLE</investmentDiscretion> <votingAuthority> <Sole>3060500</Sole> <Shared>0</Shared> <None>0</None> </votingAuthority> </infoTable> <infoTable> <nameOfIssuer>COMPANYTWO</nameOfIssuer> <titleOfClass>COM</titleOfClass> <cusip>00001</cusip> <value>67822</value> <shrsOrPrnAmt> <sshPrnamt>1898717</sshPrnamt> <sshPrnamtType>SH</sshPrnamtType> </shrsOrPrnAmt> <investmentDiscretion>SOLE</investmentDiscretion> <votingAuthority> <Sole>1898717</Sole> <Shared>0</Shared> <None>0</None> </votingAuthority> </infoTable> <infoTable> <nameOfIssuer>COMPANYTHREE</nameOfIssuer> <titleOfClass>CL B NEW</titleOfClass> <cusip>00002</cusip> <value>10462145</value> <shrsOrPrnAmt> <sshPrnamt>52078974</sshPrnamt> <sshPrnamtType>SH</sshPrnamtType> </shrsOrPrnAmt> <investmentDiscretion>SOLE</investmentDiscretion> <votingAuthority> <Sole>52078974</Sole> <Shared>0</Shared> <None>0</None> </votingAuthority> </infoTable>

我的问题是我不知道如何正确地从标签中提取值。我试过这样的东西

soup = BeautifulSoup("myData") soup = find_all("nameOfIssuer")[0].readContent()

但这让我犯了越界错误。还有一个问题是,虽然this.txt没有显示它,但是我从中获取的数据缺少一些列,我想用NaN来填充这些列。因此,理想情况下,我试图得到我的数据在tsv格式

NameofIssuer TitleofClass cusip value   shrsPrnamt  shrsPrnamtType  putcall  investmentDescrestion  othermanager   vaSole  vaShared   vaNone
COMPANYONE   CL B NEW     00000 21944   3060500     SH              NaN      SOLE                   NaN            3060500 0          0
COMPANYTWO   COM          00001 67822   1898717     SH              NaN      SOLE                   NaN            1898717 0          0

编辑:根据@RomanPerekhrest的建议,我包含了一个额外的XML文件,其中显示了othermanagerputcall标记

<ns1:infoTable> <ns1:nameOfIssuer>COMPANYFOUR</ns1:nameOfIssuer> <ns1:titleOfClass>COM</ns1:titleOfClass> <ns1:cusip>00004</ns1:cusip> <ns1:value>67</ns1:value> <ns1:shrsOrPrnAmt> <ns1:sshPrnamt>36100</ns1:sshPrnamt> <ns1:sshPrnamtType>SH</ns1:sshPrnamtType> </ns1:shrsOrPrnAmt> <ns1:putCall>Call</ns1:putCall> <ns1:investmentDiscretion>DFND</ns1:investmentDiscretion> <ns1:otherManager>01, 02</ns1:otherManager> <ns1:votingAuthority> <ns1:Sole>36100</ns1:Sole> <ns1:Shared>0</ns1:Shared> <ns1:None>0</ns1:None> </ns1:votingAuthority> </ns1:infoTable> <ns1:infoTable> <ns1:nameOfIssuer>COMPANYFIVE</ns1:nameOfIssuer> <ns1:titleOfClass>SPONSORED ADS A</ns1:titleOfClass> <ns1:cusip>00005</ns1:cusip> <ns1:value>2695</ns1:value> <ns1:shrsOrPrnAmt> <ns1:sshPrnamt>339367</ns1:sshPrnamt> <ns1:sshPrnamtType>SH</ns1:sshPrnamtType> </ns1:shrsOrPrnAmt> <ns1:investmentDiscretion>DFND</ns1:investmentDiscretion> <ns1:otherManager>01, 02</ns1:otherManager> <ns1:votingAuthority> <ns1:Sole>339367</ns1:Sole> <ns1:Shared>0</ns1:Shared> <ns1:None>0</ns1:None> </ns1:votingAuthority> </ns1:infoTable>

Tags: nonevalueshsharedns1cusipsoleinfotable
2条回答

具有lxml.etreeOrderdedDictpandas库的扩展解决方案:

我们首先需要修复一个格式错误的XML内容:主要思想是用XML名称空间ns1)添加root标记。出于演示目的,输入xml(取自问题原样)被解析为一个字符串,需要进一步修改。你知道吗

from lxml import etree
import pandas as pd
import sys
from collections import OrderedDict

xml_content = '<root xmlns:ns1="http://base.google.com/ns/1.0">{}</root>'\
    .format(open('base.xml').read())
doc = etree.fromstring(xml_content)
ns = {'ns1': 'http://base.google.com/ns/1.0'}
records = []

for block in doc.findall('ns1:infoTable', namespaces=ns):
    d = OrderedDict()
    for el in block.getchildren():
        el_tag = el.tag.replace("{{{}}}".format(ns['ns1']), '')
        inner_childs = el.getchildren()
        if inner_childs:    # if element has child nodes
            prefix = 'va' if el_tag == 'votingAuthority' else ''
            d.update({prefix + child.tag.replace("{{{}}}".format(ns['ns1']), ''): child.text
                      for child in inner_childs})
        else:
            d[el_tag] = el.text
    records.append(d)

df = pd.DataFrame(records)
print(df.to_string(index=False, justify=True))

输出:

nameOfIssuer     titleOfClass  cusip value sshPrnamt sshPrnamtType putCall investmentDiscretion otherManager  vaSole vaShared vaNone
 COMPANYFOUR              COM  00004    67     36100            SH    Call                 DFND       01, 02   36100        0      0
 COMPANYFIVE  SPONSORED ADS A  00005  2695    339367            SH     NaN                 DFND       01, 02  339367        0      0

要使用所需分隔符将结果保存到csv文件中,请使用df.to_csv()例程:

df.to_csv(path_or_buf='output.csv', sep='\t', index=False)

变量data正在连接有问题的字符串(link-太长,无法粘贴到此处):

import csv
from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt', 'sshPrnamtType', 'putCall', 'investmentDiscretion', 'otherManager', 'Sole', 'Shared', 'None']

data = []
for info_table in soup.find_all(['ns1:infotable', 'infotable']):
    row = []
    for col in cols:
        d = info_table.find([col.lower(), 'ns1:' + col.lower()])

        row.append(d.text.strip() if d else 'NaN')
    data.append(row)

headers = ['NameofIssuer', 'TitleofClass', 'cusip', 'value', 'shrsPrnamt', 'shrsPrnamtType', 'putcall', 'investmentDescrestion', 'othermanager', 'vaSole', 'vaShared', 'vaNone']
with open('data.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(headers)
    csvwriter.writerows(data)

写入data.csv

NameofIssuer,TitleofClass,cusip,value,shrsPrnamt,shrsPrnamtType,putcall,investmentDescrestion,othermanager,vaSole,vaShared,vaNone
COMPANYFOUR,COM,00004,67,36100,SH,Call,DFND,"01, 02",36100,0,0
COMPANYFIVE,SPONSORED ADS A,00005,2695,339367,SH,NaN,DFND,"01, 02",339367,0,0
COMPANYONE,SHS CLASS -A -,00000,21944,3060500,SH,NaN,SOLE,NaN,3060500,0,0
COMPANYTWO,COM,00001,67822,1898717,SH,NaN,SOLE,NaN,1898717,0,0
COMPANYTHREE,CL B NEW,00002,10462145,52078974,SH,NaN,SOLE,NaN,52078974,0,0

在LibreOffice中,它看起来:

enter image description here

相关问题 更多 >

    热门问题