Python：使用index/find在HTML中搜索Unicode字符串会返回错误的位置

from urllib2 import Request, urlopen url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1' post = None headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'} req = Request(url, post, headers) conn = urlopen(req) html = conn.read() str_start = html.index(u'Aproxim\xe1damente ') str_end = html.find(' resultados', str_start + 16) print html[str_start+16:str_end]

2条回答

网友

1楼 · 编辑于 2024-09-24 22:31:55

目前还不清楚您要做什么，但是如果我猜对了，您正试图从HTML文件中获取大约数量的结果，那么您最好使用正则表达式的re模块。在

import re
re.search(ur'(?<=Aproxim\xe1damente )\d+', s).group(0)

# returns:
#   u'37'

最终，你最好的选择是像lxml或{}这样的包，但是如果没有更多的上下文，我就无法为您提供更具体的帮助。在

网友

2楼 · 编辑于 2024-09-24 22:31:55

您的问题最终归结为这样一个事实：在python2.x中，str类型表示字节序列，而unicode类型表示字符序列。因为一个字符可以用多个字节编码，这意味着字符串的unicode类型表示的长度可能与同一字符串的str类型表示的长度不同，并且，以同样的方式，字符串的unicode表示形式上的索引可能指向文本的不同部分，而不是同一索引str表示。在

实际情况是，当您执行str_start = html.index(u'Aproxim\xe1damente ')时，Python会自动解码html变量，假设它是用utf-8编码的。（实际上，在我的个人电脑上，当我试图执行那行代码时，我只会得到一个UnicodeDecodeError。因此，如果str_start为n，则意味着{}出现在HTML的第n个字符处。但是，当您稍后使用它作为切片索引来尝试获取第（n+16）个字符之后的内容时，您实际上得到的是第（n+16）个字节之后的内容，在这种情况下，这是不等价的，因为页面的早期内容以utf-8编码时占用2个字节的重音字符为特征。在

最好的解决方案是在收到html时将其转换为unicode。对示例代码的这一小部分修改将实现您想要的效果，而不会出现错误或奇怪的行为：

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

相关问题更多 >

编程相关推荐

热门问题

热门文章