将html数据提取到Excel Spreadsh

2024-09-26 18:14:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用python“读取”html文档并将输出写入excel电子表格。HTML文件是CU(成本单位,由大写字母定义)和说明的表格。我希望在一列中有CU,在另一列中有相应的描述。我有一个全局存储部分文本,直到它到达一个CU,然后将文本放入正确的列中,但由于某些原因,代码无法完成所有CU的列表,也不会将描述放在正确的位置(将它们从适用CU向下放置一列)。谁能帮我弄清楚我做错了什么吗?以下是我目前为止的代码:

from HTMLParser import HTMLParser
import xlwt
global wb
global ws
global cucounter
global textcounter
global tempcu
textstore = ""
cucounter = 0
textcounter = 0
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')
filename = 'C:\\Python27\\ArcGIS10.3\\Doc\\Page.html'
f = open(filename, "r").read()

class MyHTMLParser(HTMLParser):

    def handle_data(self, data):
        if data.isupper():
             try:
                  global cucounter
                  ws.write(cucounter, 1, data)
                  cucounter = cucounter + 1
                  wb.save('ElecTest.xls')
             except UnicodeDecodeError:
                  pass
        if data.isspace():
              pass
        else:
            try:
             global textstore
             textstore += str(data)
             if data.isupper():
                  global textstore
                  global textcounter
                  ws.write(textcounter, 2, textstore)
                  textcounter = textcounter + 1
                  textstore = ""
                  wb.save('ElectTest.xls')
            except UnicodeDecodeError:
                  pass



parser = MyHTMLParser()
parser.feed(f)

不幸的是,我无法以正确的格式添加HTML文件(如果可以,UnicodeDecodeError处理是有意义的),但我可以复制以下内容:

页码C/U说明: M-M型

^{pr2}$

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 15"> <meta name=Originator content="Microsoft Word 15"> <link rel=File-List href="Page_files/filelist.xml"> <!--[if gte mso 9]><xml> <o:DocumentProperties> <o:Author>John Swordy</o:Author> <o:LastAuthor>John Swordy</o:LastAuthor> <o:Revision>1</o:Revision> <o:TotalTime>1</o:TotalTime> <o:Created>2017-02-15T16:44:00Z</o:Created> <o:LastSaved>2017-02-15T16:45:00Z</o:LastSaved> <o:Pages>2</o:Pages> <o:Words>600</o:Words> <o:Characters>3426</o:Characters> <o:Company>En Engineering</o:Company> <o:Lines>28</o:Lines> <o:Paragraphs>8</o:Paragraphs> <o:CharactersWithSpaces>4018</o:CharactersWithSpaces> <o:Version>16.00</o:Version> </o:DocumentProperties> <o:OfficeDocumentSettings> <o:AllowPNG/> </o:OfficeDocumentSettings> </xml><![endif]--> <link rel=themeData href="Page_files/themedata.thmx"> <link rel=colorSchemeMapping href="Page_files/colorschememapping.xml"> <!--[if gte mso 9]><xml> <w:WordDocument> <w:TrackMoves>false</w:TrackMoves> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:DoNotPromoteQF/> <w:LidThemeOther>EN-US</w:LidThemeOther> <w:LidThemeAsian>X-NONE</w:LidThemeAsian> <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> <w:SplitPgBreakAndParaMark/> <w:EnableOpenTypeKerning/> <w:DontFlipMirrorIndents/> <w:OverrideTableStyleHps/> </w:Compatibility> <m:mathPr> <m:mathFont m:val="Cambria Math"/> <m:brkBin m:val="before"/> <m:brkBinSub m:val="&#45;-"/> <m:smallFrac m:val="off"/> <m:dispDef/> <m:lMargin m:val="0"/> <m:rMargin m:val="0"/> <m:defJc m:val="centerGroup"/> <m:wrapIndent m:val="1440"/> <m:intLim m:val="subSup"/> <m:naryLim m:val="undOvr"/> </m:mathPr></w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="false" DefSemiHidden="false" DefQFormat="false" DefPriority="99" LatentStyleCount="371"> <w:LsdException Locked="false" Priority="0" QFormat="true" Name="Normal"/> <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 1"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 2"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 3"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 4"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 5"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 6"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 7"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 8"/> <w:LsdException Locked="false" Priority="9" SemiHidden="true" UnhideWhenUsed="true" QFormat="true" Name="heading 9"/> <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true" Name="index 1"/> <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true" Name="index 2"/> <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true" Name="index 3"/>

如果有人能帮助我,我将非常感激,谢谢你的时间!注意:我是自学成才的,对python还比较陌生,所以我提前为可能不太好看的代码道歉。在


Tags: namefalsetruedatavalxmlglobalpriority

热门问题