我正在使用一个名为Calibre的程序将PDF文件转换为EPUB文件,但是结果非常混乱且不可读。实际上,EPUB文件只是HTML文件的集合,转换的结果很混乱,因为Calibre将PDF文件的每一行解释为一个元素,这会在EPUB文件中创建许多难看的换行符。你知道吗
由于EPUB实际上是HTML文件的集合,因此可以使用BeautifulSoup对其进行解析。然而,我编写的程序寻找带有“calibre1”类(一个普通段落)的元素并将它们组合成单个元素(因此没有难看的换行符)不起作用,我也不知道为什么。你知道吗
靓汤能应付我要做的事吗?你知道吗
import os
from bs4 import BeautifulSoup
path = "C:\\Users\\Eunice\\Desktop\\eBook"
for pathname, directorynames, filenames in os.walk(path):
# Get all HTML files in the target directory
for file_name in filenames:
# Open each HTML file, which is encoded using the "Latin1" encoding scheme
with open(pathname + "\\" + file_name, 'r', encoding="Latin1") as file:
# Create a list, which we will write our new HTML tags to later
html_elem_list: list = []
# Create a BS4 object
soup = BeautifulSoup(file, 'html.parser')
# Create a list of all BS4 elements, which we will traverse in the proceeding loop
html_elements = [x for x in soup.find_all()]
for html_element in html_elements:
try:
# Find the element with a class called "calibre1," which is how Calibre designates normal body text in a book
if html_element.attrs['class'][0] in 'calibre1':
# Combine the next element with the previous element if both elements are part of the same body text
if html_elem_list[-1].attrs['class'][0] in 'calibre1':
# Remove nonbreaking spaces from this element before adding it to our list of elements
html_elem_list[-1].string = html_elem_list[-1].text.replace(
'\n', ' ') + html_element.text
# This element must not be of the "calibre1" class, so add it to the list of elements without combining it with the previous element
else:
html_elem_list.append(html_element)
# This element must not have any class, so add it to the list of elements without combining it with the previous element
except KeyError:
html_elem_list.append(html_element)
# Create a string literal, which we will eventually write to our resultant file
str_htmlfile = ''
# For each element in the list of HTML elements, append the string representation of that element (which will be a line of HTML code) to the string literal
for elem in html_elem_list:
str_htmlfile = str_htmlfile + str(elem)
# Create a new file with a distinct variation of the name of the original file, then write the resultant HTML code to that file
with open(pathname + "\\" + '_modified_' + file_name, 'wb') as file:
file.write(str_htmlfile.encode('Latin1'))
以下是输入:
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html>
以下是我期望发生的事情:
<?xml version='1.0' encoding='Latin1'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was a calm breeze blowing through the room. A woman who must have just walked in quietly beckoned for the counterman to approach to store her slip.642</p>
</body></html>
以下是实际输出:
<html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body></html><body class="calibre">
<p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
<p class="calibre1">In the California registry, there was</p>
<p class="calibre1">a calm breeze blowing through the room. A woman</p>
<p class="calibre1">who must have just walked in quietly beckoned for the</p>
<p class="calibre1">counterman to approach to store her slip.</p>
<p class="calibre1">642</p>
</body><p class="calibre5" id="calibre_pb_62">Note for Tyler</p>
这可以使用BeautifulSoup来完成,方法是使用
extract()
删除不需要的<p>
元素,然后使用new_tag()
创建一个新的<p>
标记,其中包含所有删除元素的文本。例如:会将HTML作为:
它通过查找
calibre1
标记的运行来工作。对于每个运行,它首先合并所有运行的文本,并在第一个运行之前插入一个新标记。然后删除所有旧标签。你知道吗对于EPUB文件中更复杂的场景,可能需要修改逻辑,但这应该可以帮助您开始。你知道吗
本例使用
lxml
解析XHTML文件并构建新的XHTML树。你知道吗用Python:3.5测试
相关问题 更多 >
编程相关推荐