BeautifulGroup:运行时错误:超过最大递归深度

2024-10-01 11:41:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我不能避免最大递归深度Python运行时错误使用BeautifulSoup。在

我正在尝试递归代码的嵌套部分并提取内容。经过修饰的HTML看起来像这样(不要问它为什么看起来是这样:):

<div><code><code><code><code>Code in here</code></code></code></code></div>

我将soup对象传递给的函数是:

^{pr2}$

你可以看到我试图增加默认递归限制,但这不是一个解决方案。我已经增加到C达到了计算机内存限制的程度,而且上面的函数永远都不起作用。在

如果有任何帮助,使其工作并指出错误,将不胜感激。在

堆栈跟踪重复此操作:

  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 512, in _find_all
    strainer = SoupStrainer(name, attrs, text, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1548, in __init__
    self.text = self._normalize_search_value(text)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1553, in _normalize_search_value
    if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
RuntimeError: maximum recursion depth exceeded while calling a Python object

Tags: inpyselflibpackageslinesitevirtualenvs
2条回答

我不确定为什么这样做(我还没有检查源代码),但是添加.text或{}似乎可以绕过这个错误。在

例如,改变

lambda x: BeautifulSoup(x, 'html.parser')

lambda x: BeautifulSoup(x, 'html.parser').get_text()似乎在不引发递归深度错误的情况下工作。在

我遇到了这个问题,浏览了很多网页。我总结了两种解决这个问题的方法。在

不过,我想我们应该知道为什么会这样。Python限制递归的数量(默认值为1000)。我们可以用print sys.getrecursionlimit()看到这个数字。我猜beauthoulsoup使用递归来查找子元素。当递归次数超过1000次时,RuntimeError: maximum recursion depth exceeded将出现。在

第一种方法:使用sys.setrecursionlimit()设置有限的递归次数。显然可以设置1000000,但可能会导致segmentation fault。在

第二种方法:使用try-except。如果出现{},我们的算法可能会有问题。一般来说,我们可以用循环代替递归。在您的问题中,我们可以预先使用replace()或正则表达式来处理HTML。在

最后,我举了一个例子。在

from bs4 import BeautifulSoup
import sys   
#sys.setrecursionlimit(10000)

try:
    doc = ''.join(['<br>' for x in range(1000)])
    soup = BeautifulSoup(doc, 'html.parser')
    a = soup.find('br')
    for i in a:
        print i
except:
    print 'failed'

如果删除#,它可以打印doc。在

希望能帮助你。在

相关问题 更多 >