这是样本test.html
文件的一部分:
<html>
<body>
<div>
...
...
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
My College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
My Name
</font>
<br/>
</h4>
My Address
<br/>
My City, XY 19604
<br/>
My Country
<br/>
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere.edu
</a>
<br/>
Website:
<a href="http://www.nowhere.edu" target="newwindow">
http://www.nowhere.edu
</a>
<br/>
<br/>
<br/>
</td>
...
...
</table>
<hr/>
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
His College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
His name
</font>
<br/>
</h4>
His Address
<br/>
His City, YX 49506
<br/>
His Country
<br/>
<br/>
Phone: XX-YY-ZZ
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere2.edu
</a>
<br/>
Website:
<a href="http://nowhere2.edu/" target="newwindow">
http://nowhere2.edu
</a>
<br/>
<br/>
...
...
</table>
...
...
</div>
</body>
</html>
我想要的输出:
My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu
起初我试着:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(table.get_text())
它以新行打印文本,但会产生大量blank lines
和white spaces
:
My College
My Name
...
然后我试着:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = ' '.join(table.text.split())
print(texts)
它删除blank lines
和white spaces
,但将所有文本合并到一行中:
My College My Name My Address ... ... http://www.nowhere2.edu
最后,我尝试使用strip()
{replace_with()
方法将<br>
替换为\n
。但我还没有成功打印出确切的输出
只需更改您的打印语句,然后像这样在那里添加换行符
您需要清除
table.get_text()
值,以便逐个打印每一行。使用2个正则表达式,您可以通过
这将输出
第一个正则表达式
{3,}
将删除所有3行或多行空行,第二个"(\n)+", "\\n"
将用一行替换\n多行,这将使打印功能逐行打印数据。此外,为了匹配预期的输出,添加了
get_text().replace('...', '')
以删除。。。从文本中删除尝试使用换行符而不是空格连接:
编辑: 前面的代码段会将多个单词行拆分为单个单词行,请尝试以下操作:
相关问题 更多 >
编程相关推荐