用Python清理HTML内容

2024-09-28 19:06:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用一个外部API,它从HTML电子邮件中向我发送文本。文本没有HTML结构(例如<html>...</html>等)。我需要清理这个文本并输出到Slack。我尝试过使用BeautifulSoup和Bleach,这两种方法都不起作用,可能是由于输入中HTML的部分特性。你知道吗

输入文本的示例如下所示:

&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.&lt;/div&gt;

我希望上面的输入输出如下:

Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.
Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.
Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.
Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.

我使用了以下简单的漂白程序:

def textify(html):
 text = bleach.clean(html)
 return text

对于BeautifulSoup,我还使用了一些正则表达式来清理输出:

def textify(html):
  html = re.sub('<br>', '\n', html)
  soup = BeautifulSoup(html)
  text = soup.getText()
  text = re.sub(r'\&lt;', '<', text)
  text = re.sub(r'\&gt\;', '>', text)
  text = re.sub(r'\&\#39\;', "'", text)
  return text

Tags: initialshortbackgroundpigporktipbeefcow
1条回答
网友
1楼 · 发布于 2024-09-28 19:06:54

在将字符串传递给漂白或美化组之前,首先需要使用standard library's html module取消对字符串的scape:

from html import unescape

html = "&lt;div style=&#39;bo...div&gt;"
unescaped_html = unescape(html)

text = bleach.clean(unescaped_html)
soup = BeautifulSoup(unescaped_html)

相关问题 更多 >