使用BeautifulSoup的Web抓取问题

def get_text(value): tdlist = [] for i in soup.findAll(value): # Reduce data to those with html tag if i.text != "": text = i.text text = text.strip() if '\n' not in text: # Remove unnecessary data tdlist.append(text) return tdlist Master_df = pd.DataFrame() logs = [] hh = 0 for tag in df_F['Value']: print(hh) hh = hh + 1 try: url = 'https://www.ayurveda.com' + tag #weblink to scrape html = urlopen(url) y = html.read() # Page title is: Scraping soup = BeautifulSoup(y, 'html.parser') # Parse resulting source c_list = [] Title = [] for value in ['p']: c_list = get_text(value) for tes in soup.findAll('h1'): Title = tes.text com_list = c_list com_list = '. '.join(com_list) com_list = com_list.replace('..',". ") com_list1 = Title df_each = pd.DataFrame(columns = ["URL","Title","Content","Category","Website"],index = range(0,1)) df_each["URL"] = url df_each["Content"] = com_list df_each["Title"] = com_list1 df_each["Category"] = 'Ayurveda' df_each["Website"] = 'Ayurveda' Master_df = Master_df.append(df_each) except Exception as e: print("Hey!, check this :",str(e)) logs.append(str(e))

线下我刮不到（用斜体标记）

阿育吠陀被许多学者认为是最古老的治疗科学。在梵语中，阿育吠陀的意思是“生命的科学”。阿育吠陀知识起源于5000多年前的印度，通常被称为“治愈之母”。它源于古代吠陀文化，几千年来一直以口头传统从成功的大师传授给他们的弟子。这些知识中的一部分在几千年前就已经出版了，但其中大部分是无法获得的。许多现在西方所熟悉的自然治疗系统的原理都源于阿育吠陀，包括顺势疗法和极性疗法。

.]2

1条回答

网友

1楼 · 发布于 2024-07-01 08:12:06

你没有得到这一段的原因是因为这一行：

if '\n' not in text:

你想要的段落：

'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.'

具有\n，因此它不会将该文本附加到tdlist。使用.strip()时，它只会删除字符串开头和结尾的新行和空格。所以你需要找到另一个条件。你知道吗

所以您可以添加一个额外的条件来获取标签<p class="bitter">后面的特定内容

我假设所有的链接都遵循这种格式。你知道吗

更改功能：

def get_text(value):
    tdlist = []
    for i in soup.findAll(value): # Reduce data to those with html tag 
        if i.text != "":
            text = i.text
            text = text.strip()
            if '\n' not in text or i.find_previous(value).attrs == {'class': ['bitter']}: # Remove unnecessary data
                tdlist.append(text)
    return tdlist

输出：

print (c_list)
['by Vasant Lad, BAM&S, MASc', 'Ayurveda is considered by many scholars to be the oldest healing science. In Sanskrit, Ayurveda means “The Science of Life.” Ayurvedic knowledge originated\n    in India more than 5,000 years ago and is often called the “Mother of All Healing.” It stems from the ancient Vedic culture and was taught for many\n    thousands of years in an oral tradition from accomplished masters to their disciples. Some of this knowledge was set to print a few thousand years\n    ago, but much of it is inaccessible. The principles of many of the natural healing systems now familiar in the West have their roots in Ayurveda, including\n    Homeopathy and Polarity Therapy.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.', 'Copyright © 2006, Vasant Lad, MASc, and The Ayurvedic Institute. All Rights Reserved.']

在我能刮到的线下（用粗体和斜体标记）

线下我刮不到（用斜体标记）

相关问题更多 >

编程相关推荐

热门问题

热门文章