如何在Python中使用BeautifulSoup消除封闭段落文本?

2024-10-04 07:26:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python新手,我正在构建一个网络爬虫来浏览互联网上的文章列表并从中获取文本。然而,当我使用函数get_text(url)时,在实际的文章内容前后,我得到了许多不必要的文本

我不知道该把它归为什么,除了不必要的(很抱歉,含糊不清)。下面有一个例子

这是我的代码:

from bs4 import BeautifulSoup
import requests
from urllib.request import Request, urlopen

def get_text(url):
 request = requests.get(url)
 if request.status_code != 200:
     print('Web site does not exist')
     return "none"

 weburl = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 data = urlopen(weburl).read()

 soup = BeautifulSoup(data, "html.parser")
 

 for script in soup(["script", "style"]):
     script.extract()

 text = soup.body.get_text()

 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 text = str.join(" ", text.splitlines())

 return text

问题是我想要的只是一篇真实的文章,而且我得到了更多。我有没有办法只从实际文章中获取文本

示例,调用以下文章站点:

get_text("https://www.zdnet.com/article/programming-languages-python-rules-while-java-dips/")

呼叫的输出

Edition: Asia Australia Europe India United Kingdom United States ZDNet around the globe: ZDNet France ZDNet Germany ZDNet Korea ZDNet Japan Search What are you looking for? Go Videos Windows 10 5G Cloud Best VPNs 2021 Security more AI TR Premium Working from Home Innovation Best Web Hosting ZDNet Recommends Tonya Hall Show Executive Guides ZDNet Academy See All Topics White Papers Downloads Reviews Galleries Videos TechRepublic Forums NewslettersAll WritersPreferencesCommunityNewslettersLog Out What are you looking for? Go Menu Videos Windows 10 5G Cloud Best VPNs 2021 Security AI TR Premium Working from Home Innovation Best Web Hosting ZDNet Recommends Tonya Hall Show Executive Guides ZDNet Academy See All Topics White Papers Downloads Reviews Galleries Videos TechRepublic Forums Preferences Community Newsletters Log Out us Asia Australia Europe India United Kingdom United States ZDNet around the globe: ZDNet France ZDNet Germany ZDNet Korea ZDNet Japan Programming languages: Python rules while Java dips Tiobe also predicts that Julia, a programming language with roots in MIT, will become a top 20 language in 2021. By Liam Tung | January 5, 2021 -- 11:19 GMT (03:19 PST) | Topic: Developer Developer: Rust programming language is being used for bigger projects Watch Now Software-checking business Tiobe has ranked Python as the top programming language of 2020 because it gained more popularity in its index than another language over the year. Tiobe, which uses programming-related queries on search engines to calculate its rankings, saw Python rise 2.01% over 2020 while Java fell 5%. techrepublic cheat sheet How to become a developer: Salaries, skills, and the best languages to learn Despite Python's star status, Tiobe's top 5 January 2021 rankings place C at the top, followed by Java, then Python, C++ and C#. SEE: Hiring Kit: Python developer (TechRepublic Premium)    Why is 35-year-old Python so popular today while Java seems to be sliding among enterprise software engineers? According to Tiobe CEO Paul Jansen, it's Python's versatility, how easy it is to learn, and high productivity. But one thing that's keeping Python from the top position is that C still offers better performance. "Python is popping up everywhere. It started as a competitor of Perl to write scripts for system administrators a long time ago," notes Jansen. "Nowadays it is the favorite language in fields such as data science and machine learning, but it is also used for web development and back-end programming and growing into the mobile application domain and even in (larger) embedded systems."Jansen predicts the question of performance will keep it off becoming the top language for some years to come. However, he also predicts that Python will soon permanently steal Java's position as the second most popular language. Despite Java being in second position this month, Python overtook Java in Tiobe's November rankings. It was the first time in the 20 years since Tiobe has tracked language popularity that Java and C weren't the top two languages. Tiobe found that Java fell by almost 5% over the past year.The company expects Julia, a programming language growing in data-science and machine-learning fields, will become a top 20 language in 2021. Julia has only been available since 2012 but made it to Tiobe's top 50 in August 2018. Julia was created for tasks in scientific computing, machine learning, data mining, large-scale linear algebra, distributed, and parallel computing. It aims for the speed of C but to be as useful for general programming as Python. SEE: Programming languages: Microsoft TypeScript leaps ahead of C#, PHP and C++ on GitHubBut some software engineers reckon Python doesn't properly serve developers who build browser applications and mobile applications in a way that JavaScript or Microsoft's type safety take on JavaScript, TypeScript, do. JavaScript currently ranks 7th on Tiobe's index, but on developer analyst firm RedMonk's latest popularity rankings it is the top programming language. TypeScript is 42nd on Tiobe, but 9th on RedMonk. Tiobe also noted that specialist statistical language R rose from 18th to 9th position over 2020. Developer Software developers: How plans to automate coding could mean big changes ahead Developer jobs: Demand for programming language Python falls amid pandemic Visual Studio 2019: Now IntelliSense linter for C++ programming language cleans up code Apple Silicon promises more powerful Macs, but developers face growing pains (ZDNet YouTube) The Best Web Hosting Providers (CNET) How to get a developer job (TechRepublic) Related Topics: Enterprise Software Open Source Mobile OS By Liam Tung | January 5, 2021 -- 11:19 GMT (03:19 PST) | Topic: Developer Show Comments LOG IN TO COMMENT My Profile Log Out | Community Guidelines Join Discussion Add Your Comment Add Your Comment More from Liam Tung Windows 10 Adobe Flash: It's finally over (well, almost) Windows 10 Microsoft plans 'sweeping' design changes to show that Windows 'is back' Enterprise Software Rocky Linux: First release is coming in Q2 2021 say developers Microsoft Zoom eyes email and calendar app to take on Google and Microsoft, says report Please review our terms of service to complete your newsletter subscription. By registering, you agree to the Terms of Use and acknowledge the data practices outlined in the Privacy Policy. You will also receive a complimentary subscription to the ZDNet's Tech Update Today and ZDNet Announcement newsletters. You may unsubscribe from these newsletters at any time. You agree to receive updates, alerts, and promotions from the CBS family of companies - including ZDNet’s Tech Update Today and ZDNet Announcement newsletters. You may unsubscribe at any time. By signing up, you agree to receive the selected newsletter(s) which you may unsubscribe from at any time. You also agree to the Terms of Use and acknowledge the data collection and usage practices outlined in our Privacy Policy. Continue Newsletters See All See All Related Stories 1 of 3 Linus Torvalds tears into Intel, favors AMD Torvalds, Linux's creator, finds AMD's processors deliver a much bigger bang for the buck than Intel's CPUs. The year ahead in DevOps and agile: Time to instill a sense of urgency For DevOps and agile to move forward in the year ahead, it's important that the business understands what's in it for them. 10 most 'disruptive' information technology jobs in the year ahead Artificial intelligence, DevOps-related skills drawing the highest premium, analysis of almost two million job openings finds. Robots for kids: STEM kits and more tech gifts for hackers of all ages If you want to spark the imagination of your kids while at the same time giving them a leg up with some of the tech skills they'll need as adults, you can't go wrong by looking at these products ... Why Red Hat dumped CentOS for CentOS Stream No, it wasn't IBM calling the shots. This decision was made inside Red Hat for business reasons and it had been a long time coming. Where Fedora fits in the new Red Hat/CentOS Stream Linux world With CentOS Stream now "tracking ahead" of Red Hat Enterprise Linux, where exactly does that leave Fedora, Red Hat's community Linux distro, and long-time RHEL test release? ... Rust programming language: We're using it for bigger projects, say developers Rust's appeal among developers and software engineers is growing as giants like Microsoft and AWS look to the language to help build infrastructure and systems. ... Using PyTorch to streamline machine-learning projects A platform that lets surgeons browse videos of past operations has found a way to make its machine learning more effective. This is what happens to your brain when you read computer code Reading software code is different to reading written language, but it also doesn't rely on parts of the brain activated by maths. ZDNet Connect with us © 2021 ZDNET, A RED VENTURES COMPANY. ALL RIGHTS RESERVED. Privacy Policy | Cookie Settings | Advertise | Terms of Use Topics Galleries Videos Sponsored Narratives Do Not Sell My Information About ZDNet Meet The Team All Authors RSS Feeds Site Map Reprint Policy Manage | Log Out Join | Log In Membership Newsletters Site Assistance ZDNet Academy TechRepublic Forums

预期输出应为:

Software-checking business Tiobe has ranked Python as the top programming language of 2020 because it gained more popularity in its index than another language over the year. Tiobe, which uses programming-related queries on search engines to calculate its rankings, saw Python rise 2.01% over 2020 while Java fell 5%. Despite Python's star status, Tiobe's top 5 January 2021 rankings place C at the top, followed by Java, then Python, C++ and C#. Why is 35-year-old Python so popular today while Java seems to be sliding among enterprise software engineers? According to Tiobe CEO Paul Jansen, it's Python's versatility, how easy it is to learn, and high productivity. But one thing that's keeping Python from the top position is that C still offers better performance. "Python is popping up everywhere. It started as a competitor of Perl to write scripts for system administrators a long time ago," notes Jansen. "Nowadays it is the favorite language in fields such as data science and machine learning, but it is also used for web development and back-end programming and growing into the mobile application domain and even in (larger) embedded systems."Jansen predicts the question of performance will keep it off becoming the top language for some years to come. However, he also predicts that Python will soon permanently steal Java's position as the second most popular language. Despite Java being in second position this month, Python overtook Java in Tiobe's November rankings. It was the first time in the 20 years since Tiobe has tracked language popularity that Java and C weren't the top two languages. Tiobe found that Java fell by almost 5% over the past year.The company expects Julia, a programming language growing in data-science and machine-learning fields, will become a top 20 language in 2021. Julia has only been available since 2012 but made it to Tiobe's top 50 in August 2018. Julia was created for tasks in scientific computing, machine learning, data mining, large-scale linear algebra, distributed, and parallel computing. It aims for the speed of C but to be as useful for general programming as Python. But some software engineers reckon Python doesn't properly serve developers who build browser applications and mobile applications in a way that JavaScript or Microsoft's type safety take on JavaScript, TypeScript, do. JavaScript currently ranks 7th on Tiobe's index, but on developer analyst firm RedMonk's latest popularity rankings it is the top programming language. TypeScript is 42nd on Tiobe, but 9th on RedMonk. Tiobe also noted that specialist statistical language R rose from 18th to 9th position over 2020.

我有没有办法过滤掉不必要的文字


Tags: andofthetoinforthatis
1条回答
网友
1楼 · 发布于 2024-10-04 07:26:42

而不是抓取整个body标记的文本,如:

text = soup.body.get_text()

更具体一点,只需抓取article标记,如:

article = ''.join([p.get_text() for p in soup.select_one('article').select('p')][1:-1])

那里发生了什么事?

  1. soup.select_one('article')选择article标记

  2. select('p')选择soup.select_one('article')结果中的所有p标记

  3. [p.get_text() for p in soup.select_one('article').select('p')]正在对来自select('p')的所有结果进行循环,并生成其文本列表

  4. 最后一步是将''.join()所有元素连接在一起,通过列表切片[1:-1]排除第一个和最后一个元素

相关问题 更多 >