“ascii”编解码器无法对ch进行编码

2024-09-29 17:14:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图将一个HTML链接解析成代码,并将其源代码作为字符串列表。由于我必须使用从中获取一些相关数据,所以我正在将所有内容解码为UTF-8方案。在

我还使用了beauthoulsoup4,它以解码的形式提取文本。在

这是我用过的代码。在

def do_underline(line,mistakes):
    last = u'</u></font>'
    first = u"<u><font color='red'>"
    a = [i.decode(encoding='UTF-8', errors='ignore') for i in line]
    lenm = len(mistakes)
    for i in range(lenm):
        a.insert(mistakes[lenm-i-1][2],last)
        a.insert(mistakes[lenm-i-1][1],first)
    b = u''
    return b.join(a)

def readURL(u):
    """
    URL -> List

    Opens a webpage's source code and extract it text
    along with blank and new lines.
    enumerate all lines.(including blank and new lines

    """
    global line_dict,q
    line_dict = {}
    p = opener.open(u)
    p1 = p.readlines()
    q = [i.decode(encoding = 'UTF-8',errors='ignore') for i in p1]
    q1 = [BeautifulSoup(i).get_text() for i in q]
    q2 = list(enumerate(q1))
    line_dict = {i:j for (i,j) in enumerate(q)}
    return q2

def process_file(f):
    """
    (.html file) -> List of Spelling Mistakes
    """
    global line_dict
    re = readURL(f)
    de = del_blankempty(re)
    fd = form_dict(de)

    fflist = []
    chklst = []

    for i in fd:
        chklst = chklst + list_braces(i,line_dict)
        fflist = fflist + find_index_mistakes(i,fd)

    final_list = list(set(is_inside_braces_or_not(chklst,fflist)))

    final_dict = {i:sorted(list(set([final_list[j] for j in range(len(final_list)) if final_list[j][0] == i])),key=lambda student: student[1]) for i in fd}

    for i in line_dict:
        if i in fd:
            line_dict[i] = do_underline(line_dict[i],final_dict[i])
        else:
            line_dict[i] = line_dict[i]

    create_html_file(line_dict)
    print "Your Task is completed"

def create_html_file(a):
    import io
    fl = io.open('Spellcheck1.html','w', encoding='UTF-8')
    for i in a:
        fl.write(a[i])
    print "Your HTML text file is created"

每次运行脚本时都会出现以下错误。在

^{pr2}$

有什么建议可以删除这个错误。 如果有一种方法可以将来自给定链接的everything解码为UTF-8,那么我认为它可以解决问题。在


Tags: infordefhtmllinedictlistutf

热门问题