是否可以使用Beautiful Soup以编程方式组合某些HTML标记的内容？问题的回答

是否可以使用Beautiful Soup以编程方式组合某些HTML标记的内容？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在使用一个名为Calibre的程序将PDF文件转换为EPUB文件，但是结果非常混乱且不可读。实际上，EPUB文件只是HTML文件的集合，转换的结果很混乱，因为Calibre将PDF文件的每一行解释为一个元素，这会在EPUB文件中创建许多难看的换行符。你知道吗 由于EPUB实际上是HTML文件的集合，因此可以使用BeautifulSoup对其进行解析。然而，我编写的程序寻找带有“calibre1”类（一个普通段落）的元素并将它们组合成单个元素（因此没有难看的换行符）不起作用，我也不知道为什么。你知道吗 靓汤能应付我要做的事吗？你知道吗 <pre><code>import os from bs4 import BeautifulSoup path = "C:\\Users\\Eunice\\Desktop\\eBook" for pathname, directorynames, filenames in os.walk(path): # Get all HTML files in the target directory for file_name in filenames: # Open each HTML file, which is encoded using the "Latin1" encoding scheme with open(pathname + "\\" + file_name, 'r', encoding="Latin1") as file: # Create a list, which we will write our new HTML tags to later html_elem_list: list = [] # Create a BS4 object soup = BeautifulSoup(file, 'html.parser') # Create a list of all BS4 elements, which we will traverse in the proceeding loop html_elements = [x for x in soup.find_all()] for html_element in html_elements: try: # Find the element with a class called "calibre1," which is how Calibre designates normal body text in a book if html_element.attrs['class'][0] in 'calibre1': # Combine the next element with the previous element if both elements are part of the same body text if html_elem_list[-1].attrs['class'][0] in 'calibre1': # Remove nonbreaking spaces from this element before adding it to our list of elements html_elem_list[-1].string = html_elem_list[-1].text.replace( '\n', '&nbsp;') + html_element.text # This element must not be of the "calibre1" class, so add it to the list of elements without combining it with the previous element else: html_elem_list.append(html_element) # This element must not have any class, so add it to the list of elements without combining it with the previous element except KeyError: html_elem_list.append(html_element) # Create a string literal, which we will eventually write to our resultant file str_htmlfile = '' # For each element in the list of HTML elements, append the string representation of that element (which will be a line of HTML code) to the string literal for elem in html_elem_list: str_htmlfile = str_htmlfile + str(elem) # Create a new file with a distinct variation of the name of the original file, then write the resultant HTML code to that file with open(pathname + "\\" + '_modified_' + file_name, 'wb') as file: file.write(str_htmlfile.encode('Latin1')) </code></pre> 以下是输入： <pre><code><?xml version='1.0' encoding='Latin1'?> <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> <body class="calibre"> Note for Tyler In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 </body></html> </code></pre> 以下是我期望发生的事情： <pre><code><?xml version='1.0' encoding='Latin1'?> <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> <body class="calibre"> Note for Tyler In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip.642 </body></html> </code></pre> 以下是实际输出： <pre><code><html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml"> <body class="calibre"> Note for Tyler In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 </body></html><body class="calibre"> Note for Tyler In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 </body>Note for Tyler </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这可以使用BeautifulSoup来完成，方法是使用<code>extract()</code>删除不需要的<code></code>元素，然后使用<code>new_tag()</code>创建一个新的<code></code>标记，其中包含所有删除元素的文本。例如： <pre><code>html = """<?xml version='1.0' encoding='Latin1'?> <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> <body class="calibre"> Note for Tyler1 In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 Note for Tyler2 In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 </body></html>""" from bs4 import BeautifulSoup from itertools import groupby import re soup = BeautifulSoup(html, "html.parser") for level, group in groupby(soup.find_all("p", class_=re.compile(r"calibre\d")), lambda x: x["class"][0]): if level == "calibre1": calibre1 = list(group) p_new = soup.new_tag('p', attrs={"class" : "calibre1"}) p_new.string = ' '.join(p.get_text(strip=True) for p in calibre1) calibre1[0].insert_before(p_new) for p in calibre1: p.extract() print(soup.prettify()) </code></pre> 会将HTML作为： <pre class="lang-html prettyprint-override"><code><?xml version='1.0' encoding='Latin1'?> <html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml"> <body class="calibre"> Note for Tyler1 In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 Note for Tyler2 In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip. 642 </body> </html> </code></pre> 它通过查找<code>calibre1</code>标记的运行来工作。对于每个运行，它首先合并所有运行的文本，并在第一个运行之前插入一个新标记。然后删除所有旧标签。你知道吗 对于EPUB文件中更复杂的场景，可能需要修改逻辑，但这应该可以帮助您开始。你知道吗

是否可以使用Beautiful Soup以编程方式组合某些HTML标记的内容？

1 个回答

相关Python问题