python如何将html文件转换为可读的txt文件?

2024-10-01 07:10:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有很多html文件是这样的:

<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>

我要做的是取出文件中间的文本并将其转换为人类可读的格式。 在本例中,它是:

 According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.

On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.

我知道我必须做三件事,它们是:

  1. 取出文件中间的文字
  2. "<br />"替换为"\n"
  3. "&nbsp;"替换为" "(一个空格)

我知道后两件事很简单,只是在Python中使用replace方法,但我不知道如何实现第一个目标。在

我对正则表达式和beauthoulsoup有点了解,但我不知道如何将它们应用到这个问题上。在

有人能帮我吗?在

谢谢,我很抱歉我的英语很差。在

@Paul:我只想要一节总结。我的老师(他对计算机不太了解)给了我很多html文件,并让我把它们转换成适合数据挖掘的格式(我的老师尝试用SAS来做这件事)。 我不知道SAS,但我想它可能用来处理很多txt文件,所以我想把这些html文件转换成普通的txt文件。在

@欧文:我需要处理很多html文件,我觉得这个问题不太难处理,所以我想用Python直接解决。在


Tags: and文件ofthetobrforby
3条回答

你可以勉强使用。在

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

http://github.com/scrapy/scrapely

最接近的方法是将HTML转换为structureText,您可以尝试联机here,其输出如下。在

 **Summary:** According to the complaint filed January 04, 2011, over a
six-week period in December 2007 and January 2008, six healthcare
related hedge funds managed by Defendant FrontPoint Partners LLC
(“FrontPoint”) sold more than six million shares of Human Genome
Sciences, Inc. (“HGSI”) common stock while their portfolio manager
possessed material negative non-public information concerning the HGSI’s
clinical trial for the drug Albumin Interferon Alfa 2-a.
 On March 2, 2011, the plaintiffs filed a First Amended Class Action
Complaint, amending the named defendants and securities violations. On
March 22, 2011, a motion for appointment as lead plaintiff and for
approval of selection of lead counsel was filed. The defendants
responded to the First Amended Complaint by filing a motion to dismiss
on March 28, 2011.

       

INDUSTRY CLASSIFICATION:
 **SIC Code:** 0000
 **Sector:** N/A
 **Industry:** N/A

要完成这项任务,您可以使用一个名为Lxml的Python库的帮助。在

  • 首先,下载并安装Lxml。在

现在尝试运行以下代码:

from lxml.html import fromstring

html = '''
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>
'''

htmlElement = fromstring(html)
textContent = htmlElement.text_content()
result = textContent.split('\n\n Summary:\n\n')[1].split('\n\nINDUSTRY CLASSIFICATION:\n\n')[0]

print result

如果所需的分类出现在所需的分类之后:\n\n。在

相关问题 更多 >