如何使用BeautifulSoup逐行打印文本?

2024-10-03 23:18:58 发布

您现在位置:Python中文网/ 问答频道 /正文

这是样本test.html文件的一部分:

<html>
<body>
<div>
...
...
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  My College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                My Name
               </font>
               <br/>
              </h4>
              My Address
              <br/>
              My City, XY 19604
              <br/>
              My Country
              <br/>
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere.edu
              </a>
              <br/>
              Website:
              <a href="http://www.nowhere.edu" target="newwindow">
               http://www.nowhere.edu
              </a>
              <br/>
              <br/>
              <br/>
             </td>
              ...
              ...
</table>
<hr/>
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  His College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                His name
               </font>
               <br/>
              </h4>
              His Address
              <br/>
              His City, YX 49506
              <br/>
              His Country
              <br/>
              <br/>
              Phone: XX-YY-ZZ
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere2.edu
              </a>
              <br/>
              Website:
              <a href="http://nowhere2.edu/" target="newwindow">
               http://nowhere2.edu
              </a>
              <br/>
              <br/>
              ...
              ...
</table>
...
...
</div>
</body>
</html>

我想要的输出:

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu

His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu

起初我试着:

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(table.get_text())

它以新行打印文本,但会产生大量blank lineswhite spaces



         My College

      My Name
...

然后我试着:

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        texts = ' '.join(table.text.split())
        print(texts)

它删除blank lineswhite spaces,但将所有文本合并到一行中:

My College My Name My Address ... ... http://www.nowhere2.edu

最后,我尝试使用strip(){}方法,还尝试使用replace_with()方法将<br>替换为\n。但我还没有成功打印出确切的输出


Tags: brhttpmyhtmltablewidthmaxhref
3条回答

只需更改您的打印语句,然后像这样在那里添加换行符

print('\n' + texts)

您需要清除table.get_text()值,以便逐个打印每一行。
使用2个正则表达式,您可以通过

from bs4 import BeautifulSoup
import re

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")

这将输出

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu    

His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu

第一个正则表达式{3,}将删除所有3行或多行空行,第二个"(\n)+", "\\n"将用一行替换\n多行,这将使打印功能逐行打印数据。
此外,为了匹配预期的输出,添加了get_text().replace('...', '')以删除。。。从文本中删除

尝试使用换行符而不是空格连接:

from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')
    for table in tables:
        texts = '\n'.join(table.text.split())
        print(texts)

编辑: 前面的代码段会将多个单词行拆分为单个单词行,请尝试以下操作:

from bs4 import BeautifulSoup    
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')    
    tables = soup.find_all('table', class_='width-max')    
    for table in tables:
        if !table.get_text().isspace():
            text = os.linesep.join([l for l in table.get_text().splitlines() if l])
            print(text.lstrip())

相关问题 更多 >