回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我有以下格式的HTML文档:</p>
<pre><code> <html><body><h2>Lorem ipsum <span name="datetime" class="0">dolor <strong>
sit</strong></span> amet, consectetur adipiscing elit.</h2>
<p>Morbi sit amet malesuada nisl. <span name="address" class="1">Phasellus <strong>rhoncus diam</strong> sit amet augue dictum</span>,
porta interdum odio tempus.</p></body></html>
</code></pre>
<p>我的输出应该是两个列表,一个包含文本中的所有单词,另一个包含跨度名称(如果适用),否则没有</p>
<pre><code> word list:
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
Morbi
sit
amet
malesuada
nisl
Phasellus
rhoncus
diam
sit
amet
augue
dictum
porta
interdum
odio
tempus
</code></pre>
<pre><code> name list:
None
None
datetime
datetime
None
None
None
None
None
None
None
None
None
address
address
address
None
None
None
None
None
None
None
None
</code></pre>
<p>我的代码:</p>
<pre><code>from bs4 import BeautifulSoup
input_file = BeautifulSoup(open("ex2.html", 'r'), 'lxml')
tags = input_file.find_all()
word_list = []
name_list = []
translator = str.maketrans(":[];.,#&*\\/", " ")
for tag in tags:
try:
name = tag.attrs['name']
except:
name = None
words = tag.text.translate(translator)
words = words.split(" ")
for word in words:
if words != '':
word_list.append(word)
name_list.append(name)
print(word_list)
print(name_list)
</code></pre>
<p>我的输出:</p>
<pre><code>['Lorem', 'ipsum', 'dolor', 'sit', 'amet', '', 'consectetur', 'adipiscing', 'elit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '\n', 'Lorem', 'ipsum', 'dolor', 'sit', 'amet', '', 'consectetur', 'adipiscing', 'elit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '\n', 'Lorem', 'ipsum', '', 'dolor', 'sit', 'dolor', 'sit', 'sit', 'Morbi', 'sit', 'amet', 'malesuada', 'nisl', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', '', 'porta', 'interdum', 'odio', 'tempus', '', 'Phasellus', 'rhoncus', 'diam', 'sit', 'amet', 'augue', 'dictum', 'rhoncus', 'diam']
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'datetime', 'datetime', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'address', 'address', 'address', 'address', 'address', 'address', 'address', None, None]
</code></pre>
<p>问题是<br/>
A.有些文本在标记中出现多次,我不知道如何修复它<br/>
B有些单词是空的(“”),但即使我在if块中检查,它仍然会被添加到列表中</p>
<p>如果有人能给我一些建议,我会很有帮助:)</p>