我有这个.txt文件,它的格式像XML,但问题是,网站检索到它警告我这是一个无效的XML格式。通过一些解析,我成功地用infoTable作为参考,在这些小片段中获得了我的信息。你知道吗
<infoTable>
<nameOfIssuer>COMPANYONE</nameOfIssuer>
<titleOfClass>SHS CLASS -A -</titleOfClass>
<cusip>00000</cusip>
<value>21944</value>
<shrsOrPrnAmt>
<sshPrnamt>3060500</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>3060500</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>COMPANYTWO</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>00001</cusip>
<value>67822</value>
<shrsOrPrnAmt>
<sshPrnamt>1898717</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>1898717</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>COMPANYTHREE</nameOfIssuer>
<titleOfClass>CL B NEW</titleOfClass>
<cusip>00002</cusip>
<value>10462145</value>
<shrsOrPrnAmt>
<sshPrnamt>52078974</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>52078974</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
我的问题是我不知道如何正确地从标签中提取值。我试过这样的东西
soup = BeautifulSoup("myData")
soup = find_all("nameOfIssuer")[0].readContent()
但这让我犯了越界错误。还有一个问题是,虽然this.txt没有显示它,但是我从中获取的数据缺少一些列,我想用NaN来填充这些列。因此,理想情况下,我试图得到我的数据在tsv格式
NameofIssuer TitleofClass cusip value shrsPrnamt shrsPrnamtType putcall investmentDescrestion othermanager vaSole vaShared vaNone
COMPANYONE CL B NEW 00000 21944 3060500 SH NaN SOLE NaN 3060500 0 0
COMPANYTWO COM 00001 67822 1898717 SH NaN SOLE NaN 1898717 0 0
编辑:根据@RomanPerekhrest的建议,我包含了一个额外的XML文件,其中显示了othermanager
和putcall
标记
<ns1:infoTable>
<ns1:nameOfIssuer>COMPANYFOUR</ns1:nameOfIssuer>
<ns1:titleOfClass>COM</ns1:titleOfClass>
<ns1:cusip>00004</ns1:cusip>
<ns1:value>67</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>36100</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:putCall>Call</ns1:putCall>
<ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
<ns1:otherManager>01, 02</ns1:otherManager>
<ns1:votingAuthority>
<ns1:Sole>36100</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>0</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
<ns1:infoTable>
<ns1:nameOfIssuer>COMPANYFIVE</ns1:nameOfIssuer>
<ns1:titleOfClass>SPONSORED ADS A</ns1:titleOfClass>
<ns1:cusip>00005</ns1:cusip>
<ns1:value>2695</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>339367</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
<ns1:otherManager>01, 02</ns1:otherManager>
<ns1:votingAuthority>
<ns1:Sole>339367</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>0</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
具有
lxml.etree
、OrderdedDict
和pandas
库的扩展解决方案:我们首先需要修复一个格式错误的XML内容:主要思想是用XML名称空间(
ns1
)添加root
标记。出于演示目的,输入xml(取自问题原样)被解析为一个字符串,需要进一步修改。你知道吗输出:
要使用所需分隔符将结果保存到csv文件中,请使用
df.to_csv()
例程:变量
data
正在连接有问题的字符串(link-太长,无法粘贴到此处):写入
data.csv
:在LibreOffice中,它看起来:
相关问题 更多 >
编程相关推荐