如何在段落标记中使用其他标记来刮除段落文本中的文本？

<div class="thecontent"> Here’s the schedule of matches for the weekend.   Saturday, August 17 Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it pritos vs. baola, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it timpao vs. quadrsa, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it Sunday, August 18 Achara vs. timpao, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it pritos vs. qaudra, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it timpao vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it   Monday, August 19 Achara vs. Buad, <a href="@">ftv</a>, <a href="https://someothertv">HTlive</a>, <a href="http://www.anothertv target="_blank">Se</a> — Have enjoy it and celebrate it  </div></body></html>

import bs4,requests getnwp = requests.get('https://url') nwpcontent = getnwp.content sp2 = bs4.BeautifulSoup(nwpcontent, 'html5lib') pta = sp2.find('div', class_ = 'thecontent').find_all('p') for i in range(len(pta)): if pta[i].get_text().find("vs") != -1: print (pta[i].get_text())

2条回答

网友

1楼 · 编辑于 2024-09-26 17:43:19

不知道真正的来源是什么样的。例如，您可以删除标记并使用:has和:not(:empty)作为目标。需要bs4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://worldsoccertalk.com/2019/08/16/epl-commentator-assignments-nbc-sports-gameweek-2-3/')
soup = bs(r.content, 'lxml')

for a in soup("a"):
    a.decompose()

for i in soup.select('.thecontent p:has(strong:not(:contains("SEE MORE"))), .thecontent p:has(strong:not(:contains("SEE MORE"))) ~ p:not(:empty)'):
    data = i.text.strip()
    if data:
        if ' vs. ' in data:
            items = data.split(',')
            print(', '.join([items[0], items[-1].split('—')[1]]))
        else:
            print(data)

网友

2楼 · 编辑于 2024-09-26 17:43:19

看起来包含内容的段落还包含提示“，-享受它并庆祝它”，因此当您检索其内容时，它总是添加。你能做的就是通过做一些类似的事情来去除绳子的尾部

if len(pta[i] > 33):
  pta[i].get_text()[:-33]

这样您将删除结果字符串的最后33个字符。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章