如何使用python提取一些没有html标记的文本

2024-10-03 21:26:28 发布

您现在位置:Python中文网/ 问答频道 /正文

如何在没有html标记的情况下提取每个句子,然后将它们添加到列表中

比如说

without_bracket = ['Jomi Jomi, okuroro ni i soni da', 'Joosua, ajooko bi eni wogbe.' etc.]

with_bracket = ['Insisting that one's children act like one makes one a wicked person', 'Joshua, a name that sounds like an act of jumping into the bush']
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>


Tags: ofthetobrdivthatisone
2条回答

尝试类似的方法,如:

from bs4 import BeautifulSoup
import re

html = """
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div> 
       """
soup = BeautifulSoup(html,'html.parser')
text=soup.find('div').text.rstrip()

with_bracket = re.findall('\(([^)]+)', text)
print(with_bracket) 
without_bracket=str(re.sub('\([^)]*\)','',text))
without_bracket=without_bracket.split('-')
without_bracket = [s.rstrip() for s in without_bracket]
without_bracket.remove('')
print(without_bracket)

结果:

["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', ' The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', " what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', ' there is no stone like potash, it is matchless.', ' The aeroplane has no business with a bad road']
[' Jomi Jomi, okuroro ni i soni da.. .', ' Joosua, ajooko bi eni wogbe. .', ' Ka gbekun yile, kii se egbe aja laelae .', 'Kaka ko san fun alajapa, pipa lori igun n pa. .', ' Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade..", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara..']

使用simplifieddoc的解决方案

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>'''
doc = SimplifiedDoc(html)
lst = doc.div.getText('\n').split('\n')
# lst = doc.getElement('div',attr='id',value='post-body-627561819859082887')
# lst = doc.getElement('div',attr='class',value='post-body entry-content')
# lst = doc.getElement('div',attr='itemprop',value='description articleBody')
without_bracket = []
with_bracket = []
for l in lst:
  tmp = l.split('(')
  without_bracket.append(tmp[0].strip('-').strip())
  with_bracket.append(tmp[1].strip('.)').strip())
print (without_bracket)
print (with_bracket)

结果:

['Jomi Jomi, okuroro ni i soni da..', 'Joosua, ajooko bi eni wogbe.', 'Ka gbekun yile, kii se egbe aja laelae', 'Kaka ko san fun alajapa, pipa lori igun n pa.', 'Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara.']
["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', 'The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', "what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', 'there is no stone like potash, it is matchless', 'The aeroplane has no business with a bad road']

相关问题 更多 >