只擦掉包含某些单词的段落

from urllib.request import Request, urlopen from bs4 import BeautifulSoup import re url = "https://www.un.org/africarenewal/news/drought-pushing-food-prices-sharply-east-africa" req = Request(url, headers={"User-Agent": 'Mozilla/5.0'}) page = urlopen(req, timeout = 5) # Open page within 5 seconds. This line skips 'empty' websites htmlParse = BeautifulSoup(page.read(), 'lxml') #html5lib SearchWords = ["drought", "water", "food"] # text must contain these words textP = "" text = "" for word in SearchWords: print(word) for r in re.findall(re.compile('.{0,100}'+word+'.{0,100}'), htmlParse.text): textP = textP + r text= text + textP print(text)

1条回答

网友

1楼 · 发布于 2024-09-27 19:27:42

要将字符串拆分为段落，可以使用Python re和

re.split(r'(?:\r\n?|\n){2,}', htmlParse.text)

接下来，您希望获得包含预定义字符串之一的唯一段落：

pars = set([p for p in re.split(r'(?:\r\n?|\n){2,}', htmlParse.text) if any(x in p for x in SearchWords)])

现在，如果要执行不区分大小写的整词搜索，可以再次使用re：

pars = set([p for p in re.split(r'(?:\r\n?|\n){2,}', htmlParse.text) if re.search(rf'\b(?:{"|".join(SearchWords)})\b', p, re.I)])

这里，\b(?:drought|water|food)\b正则表达式将drought、water或food作为整词查找，并且re.I将确保不区分大小写的搜索

相关问题更多 >

编程相关推荐

热门问题

热门文章