我正在尝试使用python“读取”html文档并将输出写入excel电子表格。HTML文件是CU(成本单位,由大写字母定义)和说明的表格。我希望在一列中有CU,在另一列中有相应的描述。我有一个全局存储部分文本,直到它到达一个CU,然后将文本放入正确的列中,但由于某些原因,代码无法完成所有CU的列表,也不会将描述放在正确的位置(将它们从适用CU向下放置一列)。谁能帮我弄清楚我做错了什么吗?以下是我目前为止的代码:
from HTMLParser import HTMLParser
import xlwt
global wb
global ws
global cucounter
global textcounter
global tempcu
textstore = ""
cucounter = 0
textcounter = 0
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')
filename = 'C:\\Python27\\ArcGIS10.3\\Doc\\Page.html'
f = open(filename, "r").read()
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if data.isupper():
try:
global cucounter
ws.write(cucounter, 1, data)
cucounter = cucounter + 1
wb.save('ElecTest.xls')
except UnicodeDecodeError:
pass
if data.isspace():
pass
else:
try:
global textstore
textstore += str(data)
if data.isupper():
global textstore
global textcounter
ws.write(textcounter, 2, textstore)
textcounter = textcounter + 1
textstore = ""
wb.save('ElectTest.xls')
except UnicodeDecodeError:
pass
parser = MyHTMLParser()
parser.feed(f)
不幸的是,我无法以正确的格式添加HTML文件(如果可以,UnicodeDecodeError处理是有意义的),但我可以复制以下内容:
页码C/U说明: M-M型
^{pr2}$<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List href="Page_files/filelist.xml">
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Author>John Swordy</o:Author>
<o:LastAuthor>John Swordy</o:LastAuthor>
<o:Revision>1</o:Revision>
<o:TotalTime>1</o:TotalTime>
<o:Created>2017-02-15T16:44:00Z</o:Created>
<o:LastSaved>2017-02-15T16:45:00Z</o:LastSaved>
<o:Pages>2</o:Pages>
<o:Words>600</o:Words>
<o:Characters>3426</o:Characters>
<o:Company>En Engineering</o:Company>
<o:Lines>28</o:Lines>
<o:Paragraphs>8</o:Paragraphs>
<o:CharactersWithSpaces>4018</o:CharactersWithSpaces>
<o:Version>16.00</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<link rel=themeData href="Page_files/themedata.thmx">
<link rel=colorSchemeMapping href="Page_files/colorschememapping.xml">
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="false"
DefSemiHidden="false" DefQFormat="false" DefPriority="99"
LatentStyleCount="371">
<w:LsdException Locked="false" Priority="0" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="true"
UnhideWhenUsed="true" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 1"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 2"/>
<w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
Name="index 3"/>
如果有人能帮助我,我将非常感激,谢谢你的时间!注意:我是自学成才的,对python还比较陌生,所以我提前为可能不太好看的代码道歉。在
目前没有回答
相关问题 更多 >
编程相关推荐