从刮取的数据中删除空格/空格/换行符

2024-06-28 20:45:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用BeautifulSoup从url中抓取数据。但是在清理之后,在清理的数据中有许多空格/空格/换行符。我尝试了.strip()函数来删除这些。但它仍然存在

代码

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
    file.writelines(text)

输出

   America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
           Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  

在上面的代码中,我将unicode字符替换为“”(空格)。如果我没有用空格代替,那么几个单词就会连在一起。 我试图获得的是一个字符串数据类型,没有不必要的空格和新行数据

新增问题

我尝试了strip(), re.sub()等各种方法来替换文本中某些行开头的空格。但是对于以下数据没有任何效果

Subscription Tickets
 All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
 Violin Virtuoso
Beethoven Virtual 5k 

我们如何删除这些空间


Tags: andoftheto数据textrevirtual
3条回答

您可以尝试:

print(re.sub('\s+',' ', text))

不清楚是否要保留一些空白以便于阅读。如果您这样做,您可以尝试以下方法:

更新:添加了只保留字母数字字符的代码,字符排除列表除外

代码:

from bs4 import BeautifulSoup
import requests


def clean_scraped_text(raw_text):

    # strip whitespaces from start and end of raw text
    stripped_text = raw_text.strip()

    processed_text = ''
    for i, char in enumerate(stripped_text):
        # add a single '\n' to processed_text for every sequence of '\n'
        if char == '\n':
            if stripped_text[i - 1] != '\n':
                processed_text += '\n'
        else:
            # if character is not '\n' add it to new_text
            processed_text += char

    # clean whitespaces from each line in new_text
    cleaned_text = ''
    for line in processed_text.splitlines():
        # only retain alphanumeric characters and listed characters 
        exclude_list = [' ', '\xa0', '-']
        line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
        cleaned_text += line.strip() + '\n'

    return cleaned_text

URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
text = BeautifulSoup(html_content, "lxml").text
print(clean_scraped_text(text))

输出:

America the Beautiful A Virtual Patriotic Salute  Flagstaff Symphony Orchestra

Contact
Hit enter to search or ESC to close


About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
All Events
This event has passed
America the Beautiful A Virtual Patriotic Salute
July 4 2020
Violin Virtuoso
Beethoven Virtual 5k
In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
CLICK HERE FOR DETAILS
Google Calendar iCal Export
Details
Date
July 4 2020
Event Category Concerts and Events

Violin Virtuoso
Beethoven Virtual 5k

Concert InfoConcerts
Concerts and Events FAQs

FSO InfoAbout FSO Mission and History
Our Team
Our Conductor
Orchestra Members
Support FSOMake a Donation
Underwriting a Concert
Sponsor a Chair
Advertise with FSO
Volunteer
Leave a Legacy
Donor Bill of Rights
Code of Ethical Standards  Used by permission of the Association of Fundraising Professionals
ResourcesCommunity  Education
For Musicians
For Board Members
2021 Flagstaff Symphony Orchestra
Copyright 2019 Flagstaff Symphony Association


About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
Contact

试试这个:

from bs4 import BeautifulSoup
import requests
import re


URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('\s+', ' ', clean_data)
print(text)
with open('read.txt', 'w') as file:
    file.writelines(text)

输出:

America the Beautiful: A Virtual Patriotic Salute – Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets « All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 « Violin Virtuoso Beethoven Virtual 5k » In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of “America the Beautiful” performed by 60 of their professional musicians, coming together virtually, to celebrate our nation’s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events « Violin Virtuoso Beethoven Virtual 5k » Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members © 2021 Flagstaff Symphony Orchestra. © Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact

相关问题 更多 >