BeautifulGroup TypeError:序列项0:应为str实例

# Use BeautifulSoup modules to format web page as text that can # be parsed and indexed # soup = bs4.BeautifulSoup(response, "html.parser") tok = "".join(soup.findAll("p", text=re.compile("."))) # pass the text extracted from the web page to the parsetoken routine for indexing parsetoken(db, tok) documents += 1

1条回答

网友

1楼 · 发布于 2024-10-02 22:38:15

这里有几个问题：

首先，我不确定您从哪里得到response，但这应该是一个实际的HTML字符串。确保你不只是从抓取一个网站的“响应”代码来告诉你它是否成功。在
不过，更重要的是，当您执行“findAll”时，请注意，这返回的是beauthulsoup对象列表，不是字符串列表。因此“join”命令不知道如何处理这些。它查看列表中的第一个对象，发现它不是一个字符串，这就是为什么它会错误地抱怨它“expected str instance”。好消息是您可以使用.text从给定的<p>元素提取实际文本。在
即使您确实使用.text从每个<p>对象中提取实际文本，但是如果列表是unicode和str格式的混合，您的join()仍然可能失败。因此，在加入之前，您可能需要执行一些编码技巧，以便将所有内容都作为同一类型。在

下面是一个我用这个页面做的例子：

>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))

这将打印“p”标记中找到的所有内容的组合文本。在

编辑：这个例子在Python2.7.x上，对于3.x，删除“.encode（'utf-8'）”。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

BeautifulGroup TypeError:序列项0:应为str实例

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >