用beauthoulsoup、Python从url中提取纯文本,但仍然不干净

2024-09-30 20:32:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试提取给定url的纯文本。 根据我的搜索,最相关的工具似乎是beauthulsoup,所以我编写了一个简单的程序来测试。 但是,我发现它仍然不能满足我的要求。结果包含许多非纯文本。在

您可以运行以下python代码来查看结果。在

import urllib
url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urllib.urlopen(url).read().decode('utf8')

from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()

当您看到raw时,结果包含如下代码:

^{pr2}$

所以我的问题是,如何真正地用Python从html中获得干净的纯文本呢。我看到很多web工具都支持所谓的book view模式,在这种模式下,你只能在大多数情况下看到主要文章,所以我认为提取干净的纯文本应该不是问题。谢谢!在


Tags: 工具代码文本import程序httpurlraw
2条回答

您需要提取stylescript标记,并使用^{}方法销毁其中的内容。从那里只需使用^{}来获得soup文本。在

from urllib.request import urlopen # import urllib in Python 2.x
from bs4 import BeautifulSoup


url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html"
html = urlopen(url).read()  
soup = BeautifulSoup(html, 'lxml') 
for tag in soup.find_all(['script', 'style']):
    tag.decompose()   
soup.get_text(strip=True)

结果是:

"Lenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons≡HomeAbout UsBlog IndexServicesNewsGuest PostContact UsYou are here:Home»Smartphone Reviews»Lenovo K3 Note Brutally Honest Review: Specifications, Pros and ConsSasidhar Kareti10:40:00 AMLenovo K3 Note Brutally Honest Review: Specifications, Pros and ConsIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch ofA6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term ‘Note’ itself re.........

好吧,你用beauthoulsoup是错误的,为了提取你的文本,你不应该得到原始文本……BS不是一个神奇的魔杖,它能从一个页面中猜出你需要什么,它需要告诉你该怎么做。因此,您应该查找要提取的对象的类和id:

>>> bs.find_all('h1')[0].getText()
u'\nLenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons\n'
>>> bs.find_all(attrs={'class': 'post-body', 'class': 'entry-content'})[0].getText()
u'\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\nIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch of A6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term \u2018Note\u2019 itself reminds us of the large phones which was actually been started mentioning by Samsung for its phablets. Like all other smartphone manufacturer companies, Lenovo also took up the term for its new boy.In this review, I\u2019ll be discussing the specifications of the K3 Note phablet in the price point of view and will be discussing the pros and cons of this device honestly brutally honestly.Let\u2019s begin! In the boxAlong with the handset, you will get a screen guard (non-tamper proof), 2-pin wall mounted charger, USB cable and removable battery in the box. K3 Note will not be accompanied by the headset in the box. That\u2019s somewhat upsetting to see A7000 coming with one and K3 Note with none. DesignNo actual changes were made to the physical design of Lenovo K3 Note compared to its predecessor, A7000. In fact, you will not see the difference between the two devices physically when kept side-by-side. \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 The screen size, body, camera, flash and speaker, buttons and slots are in the same position as A7000. K3 Note\u2019s physical design looks as good as A7000 but not build that tough. The body has low build quality and it can easily be broken under the appliance of little \u2018more\u2019 pressure. DisplayLenovo K3 Note comes with 5.5 inch Full HD IPS display that can render 401 pixels per inch (PPI) on 1080P resolution display.The screen contributes 72% to the body ratio thus making it a large screen-less body device. The best viewing angles of the screen has specified to be 178 degrees and it has 5-point touch sensor that can recognize 5-touch points simultaneously. Processor & RAMLenovo K3 Note comes with 1.7 GHz MediaTek Cortex A53 64-bit processor which is 0.2GHz faster than Lenovo A7000. The 2 GB RAM supports the processor at its best in multi-tasking.The combo is supported with ARM Mali-T760 MP2 GPU which is not so different to A7000\u2019s. You can experience good 3D gaming with this GPU configuration in parallel with the processor and RAM. MemoryK3 Note comes with 16 GB built-in ROM and allows users to expand the memory up to 32 GB through microSD card. This is an upgraded feature when compared to Lenovo A7000\u2019s 8 GB ROM.  Operating SystemK3 Note runs on Android Lollipop v5.0 which is not even 5.0.2. It is sad to see Lenovo\u2019s next product, after A7000 coming with v5.0. It is expected to get Android Lollipop v5.1 in future. CameraLenovo has upgraded the rear camera for K3 Note from 8MP to 13MP. The dual tone LED flash helps to take best shots in both lighting conditions. The camera is added with some new shooting modes compared to A7000. It can record full HD\xa01080P resolution videos with 30 frames per second rate.The front camera can take 5MP sharp photos and it is good enough to take best selfies.K3 Note\u2019s camera specifications are satisfying for its price range. ConnectivityIt supports 4G LTE networks in both the slots and have the same Wi-Fi, Bluetooth and OTG support specifications that A7000 came up with. BatteryLenovo K3 Note has got 2900mAh powered battery which can hold the charging on moderate usage for 24 hours at most. The 1080P screen absorbs the juice quickly and so it cannot last as long as A7000. Pros  A bit more fast processor  Upgraded camera  More internal memory  Full HD screen  Full HD recording  Removable battery Cons  Low built quality body  Same design as A7000  No Lollipop v5.0.2 at least  No Gorilla Glass 3 protection  High SAR values 1.590W/KG for head and 0.688W/KG for body Update: Unboxing photos (shared by a fan exclusively for Amfas Tech) \xa0  For more photos: Check out Lenovo K3 Note album on our Facebook page. \xa0 Final VerdictLenovo K3 Note has got some improvements like 16 GB internal storage, 1080P screen and video recording, little faster processor. The rest of the phone is a quite replica of Lenovo A7000. It could have been named as \u2018Lenovo A7000 Plus\u2019 instead of \u2018K3 Note\u2019.After looking at the specifications and advancements, Lenovo K3 Note for such a low price of 9,999 INR is a great deal. If you are planning to buy A7000, dare 1,000 bucks more for K3 Note and you will get a damn good phone for that price (statement made keeping price in mind).Note: If you talk more on phone, think a while choosing this phone as its SAR values are very highly specified.\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\nPlease share this article if you like it! Bless me or curse me in comments! Thank you for reading anyway!\n\n\n\n\n'

还有一些清理工作要做(主要是因为文本中的JS广告),但大部分都在那里。您需要查看要保留在主体中的标记/类/标识。在

So my question is, how can I really obtain the clean plain text from html by Python. I see many web tools support a so-called book view mode, where you can see the main article only in most cases, so I reckon it should not a problem to extract the clean plain text

它没有关联,“原始”文本只是一种不同的CSS样式,只显示文本。但这并没有使页面的来源更简单。在

相关问题 更多 >