如何在有很多链接的CSV文件列中找到img标记url，并将该链接与其他CSV文件中的相同链接进行比较

import csv # Read input Topic or Reply file csvfile = open('rad.csv', newline='') reader = csv.reader(csvfile) csvfile1 = open('new.csv', newline='') reader1 = csv.reader(csvfile1) # Extract image sources for row in reader: content = row[8] imageExists = "<img" in content and "src=\"" in content #print(imageExists) imageNum = 1 while (imageExists): startPos = content.find("src=\"") + 5 endPos = content.find("\"", startPos) imageSrc = content[startPos:endPos] print(imageSrc) content = content[endPos + 1:] imageExists = "<img" in content and "src=\"" in content #print(imageExists) for row1 in reader1: #print("In For") content1 = row1[1] content2 = row1[7] print(content1) #print(imageSrc) if content1 == imageSrc: row = imageSrc.replace(imageSrc,row1[7]) print("Done Match Found") print(content2) break else: print("No Match") #imageExists = "<img" in content and "src=\"" in content #print(imageExists) imageNum += 1

1条回答

网友

1楼 · 发布于 2024-10-04 09:24:27

我建议使用BeautifulSoup，而不是尝试将HTML解析为字符串。在下面的示例中，我假设HTML条目中没有逗号

from bs4 import BeautifulSoup

with open('example.txt','r') as file_handle:
    example_file_content = file_handle.read().split("\n")

list_of_image_sources = []
for line in example_file_content:
    line_as_list = line.split(",")
    for entry in line_as_list:
        soup = BeautifulSoup(entry.strip(), 'html.parser')
        images = soup.findAll('img')
        for image in images:
            list_of_image_sources.append(image['src'])

一旦您有了每个文件的图像源列表，就可以比较每个CSV文件的列表

相关问题更多 >

编程相关推荐

热门问题

热门文章