使用BeautifulSoup的Web抓取问题

2024-07-01 08:12:06 发布

您现在位置:Python中文网/ 问答频道 /正文

Web scraping issue (screen shot attached)

def get_text(value):
tdlist = []
for i in soup.findAll(value): # Reduce data to those with html tag 
    if i.text != "":
        text = i.text
        text = text.strip()
        if '\n' not in text: # Remove unnecessary data
            tdlist.append(text)
return tdlist

Master_df = pd.DataFrame()
logs = []

hh = 0
for tag in df_F['Value']:  

    print(hh)
    hh =  hh + 1

    try:
        url = 'https://www.ayurveda.com' + tag

        #weblink to scrape
        html = urlopen(url)
        y = html.read()

        # Page title is:  Scraping 
        soup = BeautifulSoup(y, 'html.parser') # Parse resulting source

        c_list = []
        Title = []


        for value in ['p']:
            c_list = get_text(value)

        for tes in soup.findAll('h1'):
            Title = tes.text

        com_list = c_list
        com_list = '. '.join(com_list)
        com_list = com_list.replace('..',". ")

        com_list1 = Title

        df_each = pd.DataFrame(columns = ["URL","Title","Content","Category","Website"],index = range(0,1))

       df_each["URL"] = url
       df_each["Content"] = com_list
       df_each["Title"] = com_list1
       df_each["Category"] = 'Ayurveda'
       df_each["Website"] = 'Ayurveda'

       Master_df = Master_df.append(df_each)
   except Exception as e:
       print("Hey!, check this :",str(e))
       logs.append(str(e))
#

[正在尝试下载网站中的内容。以下是从网站下载的两个重要信息。你知道吗

1)列中的标题(在“标题”中标记)-这很清楚。我得到了正确的信息 2) 另一列中的内容(标记为“p”)—我在获取这些信息时遇到问题

以下是网站提供的信息:

在我能刮到的线下(用粗体和斜体标记)

由Vasant Lad、BAM&S、MASc编写

线下我刮不到(用斜体标记)

阿育吠陀被许多学者认为是最古老的治疗科学。在梵语中,阿育吠陀的意思是“生命的科学”。阿育吠陀知识起源于5000多年前的印度,通常被称为“治愈之母”。它源于古代吠陀文化,几千年来一直以口头传统从成功的大师传授给他们的弟子。这些知识中的一部分在几千年前就已经出版了,但其中大部分是无法获得的。许多现在西方所熟悉的自然治疗系统的原理都源于阿育吠陀,包括顺势疗法和极性疗法。

.]2


Tags: textin标记com信息dffortitle
1条回答
网友
1楼 · 发布于 2024-07-01 08:12:06

你没有得到这一段的原因是因为这一行:

if '\n' not in text:

你想要的段落:

'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.'

具有\n,因此它不会将该文本附加到tdlist。使用.strip()时,它只会删除字符串开头和结尾的新行和空格。所以你需要找到另一个条件。你知道吗

所以您可以添加一个额外的条件来获取标签<p class="bitter">后面的特定内容

我假设所有的链接都遵循这种格式。你知道吗

更改功能:

def get_text(value):
    tdlist = []
    for i in soup.findAll(value): # Reduce data to those with html tag 
        if i.text != "":
            text = i.text
            text = text.strip()
            if '\n' not in text or i.find_previous(value).attrs == {'class': ['bitter']}: # Remove unnecessary data
                tdlist.append(text)
    return tdlist 

输出:

print (c_list)
['by Vasant Lad, BAM&S, MASc', 'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.']

相关问题 更多 >

    热门问题