<p>这是样本<code>test.html</code>文件的一部分:</p>
<pre><code><html>
<body>
<div>
...
...
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
My College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
My Name
</font>
<br/>
</h4>
My Address
<br/>
My City, XY 19604
<br/>
My Country
<br/>
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere.edu
</a>
<br/>
Website:
<a href="http://www.nowhere.edu" target="newwindow">
http://www.nowhere.edu
</a>
<br/>
<br/>
<br/>
</td>
...
...
</table>
<hr/>
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
His College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
His name
</font>
<br/>
</h4>
His Address
<br/>
His City, YX 49506
<br/>
His Country
<br/>
<br/>
Phone: XX-YY-ZZ
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere2.edu
</a>
<br/>
Website:
<a href="http://nowhere2.edu/" target="newwindow">
http://nowhere2.edu
</a>
<br/>
<br/>
...
...
</table>
...
...
</div>
</body>
</html>
</code></pre>
<p>我想要的输出:</p>
<pre><code>My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu
</code></pre>
<p>起初我试着:</p>
<pre><code>from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(table.get_text())
</code></pre>
<p>它以新行打印文本,但会产生大量<code>blank lines</code>和<code>white spaces</code>:</p>
<pre><code>
My College
My Name
...
</code></pre>
<p>然后我试着:</p>
<pre><code>from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = ' '.join(table.text.split())
print(texts)
</code></pre>
<p>它删除<code>blank lines</code>和<code>white spaces</code>,但将所有文本合并到一行中:</p>
<pre><code>My College My Name My Address ... ... http://www.nowhere2.edu
</code></pre>
<p>最后,我尝试使用<code>strip()</code>{<cd7>}方法,还尝试使用<code>replace_with()</code>方法将<code><br></code>替换为<code>\n</code>。但我还没有成功打印出确切的输出</p>