列表中的python html替换无效

html_string = html.replace(' ', '') soup = BeautifulSoup(html_string) tds = soup.findAll("td", {"class":"confluenceTd"}) # Find all td tags temp_holder={} # Get a temp dictionary html_as_list = list(html_string) html_join = "".join(html_as_list) prev_tag = [] for td_name in tds: ...code ommited...

...code ommited... #print("TD name: ",td_name) tag_index = '<td colspan="1" class="confluenceTd">' end_tag = '</td>' get_tag_value = str(td_name.text.strip()) get_text_index = html_string.index(get_tag_value) # Get the index which text start position_f = get_text_index - len(tag_index) # Get the index of the first character of the line "**<**" in our case full_tag = tag_index + get_tag_value + end_tag # Will print <td colspan="1" class="confluenceTd">DDD-3103</td> #print(html_string.index(str(position_f))) result1 = search_j(get_tag_value) # Get the text from p tag and check if it has jira issue if result1 == -1: continue az_id = str(query_a(result1)) res1 = re.sub(result1, " #"+az_id+" ", full_tag) # Will return <td colspan="1" class="confluenceTd"> #11111 </td> #print("Add key: ", position_f, " values: ",res1) # Check if there is any duplicate value # If douplicate exists get the position of in the original list if get_tag_value in prev_tag: dup = list_duplicates_of(html_join, get_tag_value) for dupl in dup: position_s = dupl - len(tag_index) temp_holder[position_s] = res1 else: temp_holder[position_f] = res1 #print(res1) prev_tag.append(get_tag_value) for html_keys, html_values in temp_holder.items(): #Replace the old line with the new one #print(html_keys + len(html_values)) #sys.exit() html_as_list[html_keys] = html_values print("P: ",html_keys, "V: ",html_values) html_fin = "".join(html_as_list) return html_fin filename = 'PoPs.md' with open(filename, "r") as f: html_string = f.read() result = check_td(html_string) save_filename="test.md" # w = open (save_filename, "a") w.write(str(result)) w.close()

1条回答

网友

1楼 · 发布于 2024-06-29 00:47:27

所有的问题都是因为你期望太高

使用html_as_list = list(html_string)可以创建如下字符列表

[..., "<", "t", "d", " ", "c", "o", "l", "s", "p", "a", "n", ...]

当您使用html_as_list[html_keys] = ...时，您将替换此列表中的单个元素，即单个字符

您可以尝试使用slicehtml_as_list[html_keys:position_f] = list(text)，但如果您放置更长/更短的文本，它将改变列表的大小，并将改变其他元素的位置

您可以尝试使用普通的text.replace()，但它也会更改字符串的长度，并且下一个元素将位于不同的位置，所以您必须在搜索下一个元素之前替换它

如果要替换HTML中的文本或标记，只需使用BeautifulSoup即可

  item.string = "new text"

函数find_all（和其他函数）提供对HTML树中元素的引用，以便可以更改原始HTML中的值

顺便说一句：它必须是.string，而不是.text

最小工作示例

from bs4 import BeautifulSoup as BS

html = """
<td colspan="1" class="confluenceTd">A</td>
<td colspan="1" class="confluenceTd">B</td>
<td colspan="1" class="confluenceTd">C</td>
"""

soup = BS(html, 'html.parser')

tds = soup.find_all('td', class_='confluenceTd')

for item in tds:
    if item.string == 'B':
        item.string = 'Hello World'
        
html = str(soup)
print(html)

结果（它将Hello World替换B）

<td class="confluenceTd" colspan="1">A</td>
<td class="confluenceTd" colspan="1">Hello World</td>
<td class="confluenceTd" colspan="1">C</td>

编辑：

这个版本从tds中获取单个td（Element.Tag），将Element.Tag转换为字符串，替换此字符串中的文本，将新字符串转换回Element.Tag并替换soup中的td，因此它不需要使用完整的HTML作为字符串

from bs4 import BeautifulSoup as BS
import re

html = """
<tr>
<td colspan="1" class="confluenceTd">DDD-3102</td>
<td colspan="1" class="confluenceTd">Special: <a href="">DDD-3103</a></td>
<td colspan="1" class="confluenceTd">DDD-3104<a href="">see more</a></td>
</tr>
"""

def search_j(text):
    """Simulate functon."""
    result = re.search('DDD-\d+', text)
    print('result:', result)
    if result:
        return result[0]    
    return -1
    
def query_a(item):
    """Simulate functon."""
    data = {'DDD-3102': 1102, 'DDD-3103': 1103, 'DDD-3104': 1104}
    return data[item]

def test_bs(html):
    soup = BS(html, 'html.parser')
    
    tds = soup.find_all('td', class_='confluenceTd')
    
    for item in tds:
        # convert `Element.Tag` to `string`
        item_html = str(item)
        #print('item_html:', item_html)

        # search `jira`        
        result = search_j(item_html)
        if result == -1:
            continue
        
        # get `id` for `jira`
        az_id = str(query_a(result))
        
        # replace it
        new_item_html = item_html.replace(result, " #"+az_id+" ")
        #print('new_item_html:', item_html)
        
        # convert `string` back to `Element.Tag`
        new_item = BS(new_item_html, 'html.parser')
        
        # replace `item` in `soup`
        item.replace_with(new_item)
            
    print(soup)

test_bs(html)

之前：

<tr>
<td colspan="1" class="confluenceTd">DDD-3102</td>
<td colspan="1" class="confluenceTd">Special: <a href="">DDD-3103</a></td>
<td colspan="1" class="confluenceTd">DDD-3104<a href="">see more</a></td>
</tr>

之后：

<tr>
<td class="confluenceTd" colspan="1"> #1102 </td>
<td class="confluenceTd" colspan="1">Special: <a href=""> #1103 </a></td>
<td class="confluenceTd" colspan="1"> #1104 <a href="">see more</a></td>
</tr>

相关问题更多 >

编程相关推荐

热门问题

热门文章