Python: Beautifulsoup 返回错误的 tis-620 解码，字符集为 windows-874

资料来源：

“”“http equiv=“Content Type”Content=“text/html；字符集=windows-874”

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head> <meta http-equiv="Content-Type" content="text/html; charset=windows-874" /> <meta name="description" content="Discussion Forum" /> </head> <body> hello <dl> <dt>no English Thai abbrev. phonemic</dt> <dt>01 January มกราคม ม.ค. mohkH gaL raaM khohmM</dt> <dt>02 February กุมภาพันธ์ ก.พ. goomM phaaM phanM</dt> <dt>03 March มีนาคม มี.ค. meeM naaM khohmM</dt> <dt>04 April เมษายน เม.ย. maehM saaR yohnM</dt> <dt>05 May พฤษภาคม พ.ค. phreutH saL phaaM khohmM</dt> <dt>06 June มิถุนายน มิ.ย. miH thooL naaM yohnM</dt> <dt>07 July กรกฎาคม ก.ค. gaL raH gaL daaM khohmM</dt> <dt>08 August สิงหาคม ส.ค. singR haaR khohmM</dt> <dt>09 September กันยายน ก.ย. ganM yaaM yohnM</dt> <dt>10 October ตุลาคม ต.ค. dtooL laaM khohmM</dt> <dt>11 November พฤศจิกายน พ.ย. phreutH saL jiL gaaM yohnM</dt> <dt>12 December ธันวาคม ธ.ค. thanM waaM khohmM</dt> </dl> </body> </html>

Python pgm：

2。不带beautifulsoup的示例

import urllib2 import sys sys.setdefaultencoding("tis-620") reload(sys) fthai = open('translation_thai.html','r') array = [] x=-1 for line in fthai: x=x+1 array.append( line ) print array[x],x sp = array[17].split('\t') sp1 = sp[2].encode('utf8')

sp返回['03'，'March'，'\xc1\xd5\xb9。。。。。。在

sp1返回'\xe0\xb8\xa1\xe0\xb8\xb5

正确！在

根据utf8表

3617 e21มE0B8A1 1110 0000 1011 1000 1010 0001ม

3637 e35ีE0B8B5 1110 0000 1011 1000 1011 0101ี

有人能告诉我如何纠正错误的行为吗。在

1条回答

网友

1楼 · 发布于 2024-06-02 02:18:31

我的解决方案：

我现在调用convert而不是encode（'utf-8'），它有点慢，但它起作用了。在

def convert(content):
    #print content
    result = ''
    for char in content:
        asciichar = char.encode('ascii',errors="backslashreplace")[2:]
        if asciichar =='':
            utf8char = char.encode('utf8')
        else:  
            try:
                hexchar =  asciichar.decode('hex')
            except:
                #print asciichar
                utf8char = ' '
            try:
                utf8char = hexchar.encode('utf-8')
            except:
                #print hexchar
                utf8char = ' '
            #print utf8char

        result = result + utf8char    
        #print result
    return result

资料来源：

1。示例

2。不带beautifulsoup的示例

相关问题更多 >

编程相关推荐

热门问题

热门文章