列表中的python html替换无效

2024-06-29 00:47:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下代码,从md文件加载所有td并搜索特定的合作伙伴。部分代码如下所示

html_string = html.replace(' ', '')
    soup = BeautifulSoup(html_string)
    tds = soup.findAll("td", {"class":"confluenceTd"}) # Find all td tags
    temp_holder={} # Get a temp dictionary

    html_as_list = list(html_string)
    html_join = "".join(html_as_list)
    prev_tag = []

 

    for td_name in tds:
...code ommited...
...code ommited...
            #print("TD name: ",td_name)
            tag_index = '<td colspan="1" class="confluenceTd">'
            end_tag = '</td>'
          
            get_tag_value = str(td_name.text.strip())         
            get_text_index = html_string.index(get_tag_value) # Get the index which text start 
            position_f = get_text_index - len(tag_index) # Get the index of the first character of the line "**<**" in our case

            full_tag = tag_index + get_tag_value + end_tag # Will print <td colspan="1" class="confluenceTd">DDD-3103</td>
            #print(html_string.index(str(position_f)))

            result1 = search_j(get_tag_value) # Get the text from p tag and check if it has jira issue
            if result1 == -1:
                continue
 
            az_id = str(query_a(result1))         

            res1 = re.sub(result1, " #"+az_id+" ", full_tag) # Will return <td colspan="1" class="confluenceTd"> #11111 </td>
            #print("Add key: ", position_f, " values: ",res1)

 

            # Check if there is any duplicate value
            # If douplicate exists get the position of in the original list
            if get_tag_value in prev_tag:
                dup = list_duplicates_of(html_join, get_tag_value)
                for dupl in dup:
                    position_s = dupl - len(tag_index)
                    temp_holder[position_s] = res1
            else:
                temp_holder[position_f] = res1       
            #print(res1)
            prev_tag.append(get_tag_value)

    for html_keys, html_values in temp_holder.items(): #Replace the old line with the new one
        #print(html_keys + len(html_values))
        #sys.exit()
        html_as_list[html_keys] = html_values
        print("P: ",html_keys, "V: ",html_values)

    html_fin = "".join(html_as_list)
    
    return html_fin


filename = 'PoPs.md'

with open(filename, "r") as f:
    html_string = f.read()
 
result = check_td(html_string)

save_filename="test.md"
#
w = open (save_filename, "a")
w.write(str(result))
w.close()

我使用一个临时字典来保存更新值,如下所示 关键:在我们的例子中,第一个字符的位置是“<;”吗 值:是更新值

印刷品将显示:

P: 2651  V: "<td colspan="1" class="confluenceTd"> #11111 </td>"

最终结果保存在下面的文件中

...output ommited...
<td colspan="1" class="confluenceTd"> #11111 </td>td colspan="1" class="confluenceTd">DDD-3103</td>
...output ommited...

如您所见,只需更换<而不是整行

我希望替换

"<td colspan="1" class="confluenceTd">DDD-3103</td>" with "<td colspan="1" class="confluenceTd"> #11111 </td>"

为了使其正常工作,我可能缺少什么

有什么想法吗


Tags: theingetstringindexvaluehtmltag
1条回答
网友
1楼 · 发布于 2024-06-29 00:47:27

所有的问题都是因为你期望太高

使用html_as_list = list(html_string)可以创建如下字符列表

[..., "<", "t", "d", " ", "c", "o", "l", "s", "p", "a", "n", ...]

当您使用html_as_list[html_keys] = ...时,您将替换此列表中的单个元素,即单个字符

您可以尝试使用slicehtml_as_list[html_keys:position_f] = list(text),但如果您放置更长/更短的文本,它将改变列表的大小,并将改变其他元素的位置

您可以尝试使用普通的text.replace(),但它也会更改字符串的长度,并且下一个元素将位于不同的位置,所以您必须在搜索下一个元素之前替换它


如果要替换HTML中的文本或标记,只需使用BeautifulSoup即可

  item.string = "new text"

函数find_all(和其他函数)提供对HTML树中元素的引用,以便可以更改原始HTML中的值

顺便说一句:它必须是.string,而不是.text


最小工作示例

from bs4 import BeautifulSoup as BS

html = """
<td colspan="1" class="confluenceTd">A</td>
<td colspan="1" class="confluenceTd">B</td>
<td colspan="1" class="confluenceTd">C</td>
"""

soup = BS(html, 'html.parser')

tds = soup.find_all('td', class_='confluenceTd')

for item in tds:
    if item.string == 'B':
        item.string = 'Hello World'
        
html = str(soup)
print(html)

结果(它将Hello World替换B

<td class="confluenceTd" colspan="1">A</td>
<td class="confluenceTd" colspan="1">Hello World</td>
<td class="confluenceTd" colspan="1">C</td>

编辑:

这个版本从tds中获取单个tdElement.Tag),将Element.Tag转换为字符串,替换此字符串中的文本,将新字符串转换回Element.Tag并替换soup中的td,因此它不需要使用完整的HTML作为字符串

from bs4 import BeautifulSoup as BS
import re

html = """
<tr>
<td colspan="1" class="confluenceTd">DDD-3102</td>
<td colspan="1" class="confluenceTd">Special: <a href="">DDD-3103</a></td>
<td colspan="1" class="confluenceTd">DDD-3104<a href="">see more</a></td>
</tr>
"""

def search_j(text):
    """Simulate functon."""
    result = re.search('DDD-\d+', text)
    print('result:', result)
    if result:
        return result[0]    
    return -1
    
def query_a(item):
    """Simulate functon."""
    data = {'DDD-3102': 1102, 'DDD-3103': 1103, 'DDD-3104': 1104}
    return data[item]

def test_bs(html):
    soup = BS(html, 'html.parser')
    
    tds = soup.find_all('td', class_='confluenceTd')
    
    for item in tds:
        # convert `Element.Tag` to `string`
        item_html = str(item)
        #print('item_html:', item_html)

        # search `jira`        
        result = search_j(item_html)
        if result == -1:
            continue
        
        # get `id` for `jira`
        az_id = str(query_a(result))
        
        # replace it
        new_item_html = item_html.replace(result, " #"+az_id+" ")
        #print('new_item_html:', item_html)
        
        # convert `string` back to `Element.Tag`
        new_item = BS(new_item_html, 'html.parser')
        
        # replace `item` in `soup`
        item.replace_with(new_item)
            
    print(soup)

test_bs(html)        

之前:

<tr>
<td colspan="1" class="confluenceTd">DDD-3102</td>
<td colspan="1" class="confluenceTd">Special: <a href="">DDD-3103</a></td>
<td colspan="1" class="confluenceTd">DDD-3104<a href="">see more</a></td>
</tr>

之后:

<tr>
<td class="confluenceTd" colspan="1"> #1102 </td>
<td class="confluenceTd" colspan="1">Special: <a href=""> #1103 </a></td>
<td class="confluenceTd" colspan="1"> #1104 <a href="">see more</a></td>
</tr>

相关问题 更多 >