Encode()不适用于所有情况

2024-05-09 23:00:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用BeautifulSoup4扫描一个html文件并提取某些特性。具体来说,我是用它来寻找足球运动员的名字,俱乐部,联赛,统计等,因为许多球员和俱乐部的名字有重音符号,我正在寻找一种方法打印出这些重音符号,而不是看到像“Kak\xe1”这样的输出,我能够使它工作,通过使用

# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')

这打印出正确的球员名字:“卡卡”然而,我没有看到相同的结果时,使用正则表达式提取俱乐部名称,例如

regex_club = re.compile(ur'\[.*?</strong>\\n\s+\|\s\\n\s+(.*?)\\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')

这段代码用于打印出正确的俱乐部名称,例如“Atl\xe9tico Madrid”,但是encode()无法删除“\xe9”并将其替换为“é”

下面是我应用regex的html文件片段

<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
    <a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
        <span class="player-rating stream-col-50 text-center">
            <span class="revision-gradient shadowed font-12 fut elite">100</span>
        </span>
        <span class="player-info">
            <img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
            <img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
            <span class="player-name">Jan Oblak</span>
            <span class="player-club-league-name">
                <strong>GK</strong>
                 | 
                Atlético Madrid
                 | 
                LaLiga Santander
            </span>
        </span>

        <span class="player-right text-center hidden-xs">
            <span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
        </span>
        <span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
            <span class="slide-content text-upper">
                <span class="trigger icon icon-dots-three-horizontal"></span>


                <span class="player-stat stream-col-80">
                    <span class="value">+2</span>
                    <span class="hover-label">MRK</span>
                </span>


                <span class="player-stat stream-col-80">
                    <span class="value">+1</span>
                    <span class="hover-label">OVR</span>
                </span>

                <span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
                <span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
            </span>
        </span>

    </a>
</div>

所以基本上,当我在中间使用regex时,为什么encode()不起作用?如果需要进一步澄清,请告诉我。非常感谢。你知道吗


Tags: namestreamvaluehtmlgroupcollabelstat
1条回答
网友
1楼 · 发布于 2024-05-09 23:00:59

我怀疑您尚未显示所有代码(请参见[mcve]),但对Unicode对象调用str是错误的,应该给出:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)

我怀疑你做了一个setdefaultencoding,也就是一个bad habit。你知道吗

str()所做的是将Unicode字符串转换成带有转义码文本的字节字符串,例如'\\n'(两个字符)而不是'\n'(一个字符),对于非ascii字符也是如此。你知道吗

如果终端配置正确,打印时也不必手动编码最终结果。你知道吗

下面是一个使用BeautifulSoup检索要解析的文本的工作示例:

from  bs4 import BeautifulSoup
import re

# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)

Atlético Madrid

相关问题 更多 >