python解析带有空条目的网站表

class MyParser(HTMLParser): def __init__(self, *args, **kwargs): #There are only 2 tables in the source code. Outer one is useless to me self.outerloop = True #Set to true when we are in the table, and we want to collect data self.capture_data = False #Array to store the captured data self.dataArray = [] HTMLParser.__init__(self, *args, **kwargs) def handle_starttag(self, tag, attrs): if tag == 'table' and self.outerloop: self.outerloop=False elif tag=='td' and not self.outerloop: self.capture_data=True elif tag=='th': self.capture_data=False def handle_endtag(self, tag): if tag == 'table': self.capture_data=False def handle_data(self, data): if self.capture_data: self.dataArray.append(data) #Function to call the parser def getData(self): self.p = MyParser() url = 'http://www.mysite.com/get.php' content = urllib.urlopen(url).read() self.p.feed(content) val=0 resultString="" while val < len(self.p.dataArray): resultString+=self.p.dataArray[val]+"," val+=1 return HttpResponse(resultString[:-1])

2条回答

网友
                    
                    

                    

                    1楼 ·

                    
                        编辑于 2024-09-28 01:27:44

好吧，我知道回答你自己的问题是不受欢迎的，但万一将来有人遇到这个问题，我就把我的来源列出来。你知道吗
我用两个整数来修正它。他们都从0开始。当我在questin中遇到开始标记时，我会增加其中一个数字。在处理数据时，我将第二个数字递增。当我遇到这个特殊标记的结束标记时，我检查了这些数字是否相等，如果数据被使用，它们应该相等。你知道吗
如果结果是数字不相等，那么就意味着程序没有处理数据，即一个空白标记。然后我简单地将N/A附加到数组中，并使其工作。
请看这里：
class MyHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.outerloop = True
        self.capture_data = False
        self.dataArray = []
        self.celldata="NA"
        self.firstnum=0
        self.secondnum=0
        HTMLParser.__init__(self, *args, **kwargs)

    def handle_starttag(self, tag, attrs):
        if tag == 'table' and self.outerloop:
            self.outerloop=False
        elif tag=='td' and not self.outerloop:
            self.capture_data=True # bool to indicate we want to capture data
            self.firstnum+=1    # increment first num to say we have encountered the tag in question
        elif tag=='th':
            self.capture_data=False

    def handle_endtag(self, tag):
        if tag == 'table':
            self.capture_data=False
        elif tag == 'td' and not self.firstnum == self.secondnum:   #check if they are not equal
            self.dataArray.append(self.celldata)    # append filler data
            self.secondnum=self.firstnum    # make them equal for next tag

    def handle_data(self, data):
        if self.capture_data::
            self.dataArray.append(data)
            self.secondnum=self.firstnum

def getTides(self):
    self.p = MyHTMLParser()

    url = 'http://www.mysite.com/page.php'
    content = urllib.urlopen(url).read()
    self.p.feed(content)

    val=0
    resultString=""

    while val < len(self.p.dataArray):
        resultString+=self.p.dataArray[val]+","
        val+=1

    return HttpResponse(resultString[:-1])

网友
                    
                    

                    

                    2楼 ·

                    
                        编辑于 2024-09-28 01:27:44

一种可能的解决方案是在查找<td>时添加cell_data=""，在handle_data上用cell_data += data更新它，并将cell_data附加到</td>上的数据数组

`相关问题更多 >`

`编程相关推荐`

`热门问题`

`热门文章`