Python和Beautifulsoup问题删除soup对象中的空标记

2024-10-04 11:33:06 发布

您现在位置:Python中文网/ 问答频道 /正文

这么长时间的用户,最近刚刚创建了一个账号。这是我第二次尝试在这里提问。我对Python相当陌生,但有编程经验,对web废弃也非常陌生。你知道吗

问题

我编写了一个函数来下载一系列格式非常相似的HTML文件。然后我使用BeautifulSoup解析HTML文件,并最终将数据加载到SQL表中。我正在对已经存在的列/表进行差距分析,看看它们有多大的不同。我试图读取某个HTML标记,在某些情况下,有一个额外的空标记集。我真正想做的是简单地删除这个额外的条目并继续前进。我尝试过使用decompose()函数,也尝试过按索引引用值并执行delete。你知道吗

<dt class="dlterm"></dt>

当我稍后试图将列名、数据类型和描述存储为一个记录时,这会丢弃我的列。我不知道如何删除它并继续解析文件。你知道吗

我可以让Python找到<dt class="dlterm"></dt>,并且尝试了decompose(),pop()方法,我甚至考虑了提出一个偏移量,在找到它时将变量设置为1,然后在循环的迭代中以某种方式将其余代码偏移1。你知道吗

我已经找到的一个解决方案是,在我尝试使用beautifulsoup阅读之前,打开源文件并替换<dt class="dlterm"></dt>标记,从而完全解决这个问题。借用一个老同事的话,那就是“黄鼠狼”的出路。它会工作,但似乎像一个简单的问题很多代码。你知道吗

问题

我以为soup对象是一个列表,但它的行为不是这样的?汤对象的恰当术语是什么?你知道吗

Python代码段

# Load the cursor/recordset
myrecordset = mycursor.fetchall() 

# Outer loop
    for y in myrecordset:

        myfilepath = "myexample.html" % y[2]
        soup = BeautifulSoup(open(myfilepath),"html.parser")

        PageName = soup.find("h1",{"class":"topictitle1"})

        # print ("PageName: " + PageName.text)
            FieldName = soup.find_all("dt", {"class":"dlterm"})
            FieldDataType = soup.find_all("samp", {"class":"codeph"})
            FieldDesc = soup.find_all("dd", {"class":"ddexpand"})
            # outercounter = -1
            #
            # #Fix the empty value issue early that is offsetting everything
            # for z in FieldName:
            #     outercounter+=1
            #     # FieldName[7].decompose()
            #     if z.text == '': # '<dt class="dlterm"></dt>':
            #         z.decompose()
            #
            #         # FieldName[outercounter-1].pop()



            # How to get get the description cleaned up
            # FieldDesc[2].text.replace('\n','').replace('      ', ' ')
            # print(FieldName.text)
            # print(FieldDataType.text)
            # print(FieldDesc.text)

            # inner loop
            innercounter1 = 0
            # zip allows me to iterate through multiple lists at the same time
            for (fn, fdt, fd) in zip(FieldName, FieldDataType, FieldDesc):

                fntemp= ''
                fdttemp= ''
                fdtemp= ''

                fntemp = fn.text
                fdttemp = fdt.text

                # clean the string
                if fd.text.__contains__('One of:'):
                    # hold onto the double return while I replace the others.
                    fdtemp = fd.text.replace('\n\n', '<<nn>>')
                    fdtemp = fdtemp.replace('\n',', ')
                    fdtemp = fdtemp.replace('<<nn>>', '\n')
                else:
                    fdtemp = fd.text.replace('\n', ' ')

                fdtemp = fdtemp.strip()

                # remove all redundant spaces from the string
                fdtemp = " ".join(fdtemp.split())
                # have to escape single quotes in text so it will insert correctly
                fdtemp = fdtemp.replace("'", "''")

                #Insert into SQL

                # ... code continued

显示问题的HTML文件片段

<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>

如果有人能帮我解决这个问题,那就太棒了。你知道吗


Tags: 文件thetexthtmldtreplaceddclass
1条回答
网友
1楼 · 发布于 2024-10-04 11:33:06

decompose()足以解决您的问题。你知道吗

from bs4 import BeautifulSoup
html="""
<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>
"""
soup=BeautifulSoup(html,'html.parser')
for tag in soup.find_all('dt',attrs={"class":"dlterm"}): #all dl tags with class dlterm
    if not tag.text: #if tag is empty
        tag.decompose()
print(soup)

输出

<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>

相关问题 更多 >