与BeautifulSoup和tags抗争

from bs4 import BeautifulSoup import requests URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture' page = requests.get(URL) html_soup = BeautifulSoup(page.content, 'html.parser') type(html_soup) print (html_soup.find("div", class_="mw-parser-output").find_all(text=True, recursive=False) )

3条回答

网友

1楼 · 编辑于 2024-09-29 23:15:46

试试这个。您的预期输出在第节中

from bs4 import BeautifulSoup
import requests

URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)

html_soup = BeautifulSoup(page.content, 'html.parser')
print(html_soup.prettify())


print ([x.text for x in html_soup.find("section", class_="mf-section-1").find_all('a')])

网友

2楼 · 编辑于 2024-09-29 23:15:46

这里有几个问题：首先recursive=False参数意味着您将只获得直接位于所选节点内部的文本。您将无法从其子节点获取文本。由于该div元素中没有直接的文本，因此该方法返回一个空列表

第二个问题：您选择的div不仅包含“心理折磨方法”部分，还包含页面的其他部分以及文章开头显示的免责声明。要获取所需的信息，应该只获取类为mf-section-1的section节点的内容

解决方案

我只是调整了你的代码来打印你需要的信息。我不得不使用lstrip方法删除不必要的换行符

from bs4 import BeautifulSoup
import requests

URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)

html_soup = BeautifulSoup(page.content, 'html.parser')
type(html_soup)

print (''.join(html_soup.find("section", class_="mf-section-1").findAll(text=True)).lstrip("\n"))

输出

Ego-Fragmentation
Learned Helplessness
Chinese water torture
Welcome parade (torture)

网友

3楼 · 编辑于 2024-09-29 23:15:46

当你有疑问的时候，用暴力强迫它，假装你以后会回来

from bs4 import BeautifulSoup
import requests

URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)

html_soup = BeautifulSoup(page.content, 'html.parser')

sections = html_soup.find_all("section")
torture_methods = sections[1].find_all("li")
torture_method_names = list(map(lambda x: x.text, torture_methods))
print(torture_method_names)

印刷品：

['Ego-Fragmentation', 'Learned Helplessness', 'Chinese water torture', 'Welcome parade (torture)']

相关问题更多 >

编程相关推荐

热门问题

热门文章