如何解析csv文件中某列中所有行的HTML编码文本？ - 问答 - Python中文网

如何解析csv文件中某列中所有行的HTML编码文本？

2024-10-02 20:36:46 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

Heres an image of how the data looks in the 'content' column

我在熊猫中加载了一个csv文件。在“Content”列中，每一行都包含不同长度的html编码文本。有些是500多字。我的目标是去掉“content”列所有行中的所有html编码

有人能帮我弄到密码吗

到目前为止我只有这个。。。数据集=pd.read\u csv（'NuggetData.csv'）

“Content”是表中的第9列（如果第一列为0），大约有17000行

内容列中的示例文本（顺便说一句，这也不是第1行的全文，甚至更长）：

第1行：<h2>A bold new toy commercial debuted last week, and it's got the internet talking.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e9536dca7292001f000008/attachments/toygif1-65977b573530a2407626f8a4aad22a4e.gif" class=""><div class="image-caption"><p>GIFs via Smyths Toys.</p></div></div></div><h2>In some ways, it was pretty standard because a boy's love for rocket ships isn't all that unique.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e953b8e2d8c7001f00002d/attachments/toygif2-6ef9ddacf2a56c63a84d773645450563.gif" class=""></div></div><h2>Neither is his love of Legos.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e95558e2d8c7002b000025/attachments/toygif4-4f0829dad2602f7dd6ed52813e6791a5.gif" class=""></div></div><h2>Plenty of boys like to (pretend to) drive motorcycles, too.</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e95595ca72920034000029/attachments/toygif5-e1824fae63099796ac2947ba76ea185d.gif" class=""></div></div><h2>But ... playing dress-up as a queen in front of a crowd of cheering supporters?</h2><div><div data-reactroot="" class="push-wrapper--mobile" data-card="image"><img src="//i.upworthy.com/nugget/57e954c0e2d8c7002d00001e/attachments/toygif3-21ea60c5917fd80da817919c655a4c96.gif" class=""></div></div><p><em>That's</em> extraordinary. </p><h2>

Tags： of image div src com img data h2

1条回答

网友

1楼 · 发布于 2024-10-02 20:36:46

我建议您使用BeautifulSoup（库）和列表理解来解析您的内容列

首先，您需要知道您需要从HTML中获得什么内容。我做了一些假设来解释：

您正在DIV标记（findAll('div')）中查找内容
假设您正在查找前一个标记（.text）中的文本
您需要来自第三个DIV标记（[2]）的文本

from bs4 import BeautifulSoup as bs

dataset['parsed_content'] = [bs(x,'lxml').findAll('div')[2].text for x in dataset['content']]

在前面的代码中，您向数据帧添加了一个新列，在任何情况下都不会修改内容

可以使用pip安装ulsou和lxml的依赖关系

相关问题更多 >

编程相关推荐

热门问题

热门文章