擅长:python、mysql、java
<p>您需要清除<code>table.get_text()</code>值,以便逐个打印每一行。<br/>
使用2个正则表达式,您可以通过</p>
<pre><code>from bs4 import BeautifulSoup
import re
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")
</code></pre>
<p>这将输出</p>
<pre><code>My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu
</code></pre>
<p>第一个正则表达式<code>{3,}</code>将删除所有3行或多行空行,第二个<code>"(\n)+", "\\n"</code>将用一行替换\n多行,这将使打印功能逐行打印数据。<br/>
此外,为了匹配预期的输出,添加了<code>get_text().replace('...', '')</code>以删除。。。从文本中删除</p>