在BeautifulSoup4中获取标记之间的句子长度

2024-10-02 10:24:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个网站上收集一些统计数据,我试图做的是提取一个word并计算在同一个标签中发现的相邻单词的数量

输入

<div class="col-xs-12">
   <p class="w50">Operating Temperature (Min.)[°C]</p>
   <p class="w50 upperC">-40</p>
</div>

会导致

标签1

Operating , 2 i.e #<Temperature, (Min.)[°C]>
Temperature, 2 i.e #<Operating, (Min.)[°C]>
(Min.)[°C], 2 i.e #<Operating,Temperature>

标签2

-40, 0

这就是我最终的目的,但它提取了整个文本

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())

我试着用recursive = True进行测试,但是结果重复了很多


Tags: indivurlfortagpagescriptoperating
1条回答
网友
1楼 · 发布于 2024-10-02 10:24:39

它可能不是你执行的结果,但至少它给了你一个提示。我修改了你的代码。你知道吗

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("              -")

相关问题 更多 >

    热门问题