漂亮的汤\u003b出现并弄乱了所有的东西？

article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824' session = requests.Session() retry = Retry(connect=3, backoff_factor=0.5) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36' request_header={ 'User-Agent': user_agent} source=session.get(article_url, headers=request_header).text soup = BeautifulSoup(source,'lxml') #get all <p> paragraphs from article paragraphs=soup.find_all('p') #print each paragraph as a line for paragraph in paragraphs: print(paragraph)

1条回答

网友

1楼 · 发布于 2024-05-03 12:45:11

至少对我来说，我必须用正则表达式提取一个包含数据的javascript对象，然后用json解析为json对象，然后获取与页面html相关的值，就像在浏览器中看到的那样，对其进行处理，然后提取段落。我删除了重试的东西；你可以很容易地重新插入

import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json

article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])

for p in soup.select('p'):
    print(p.text.strip())

Regex:

相关问题更多 >

编程相关推荐

热门问题

热门文章