<p><strong>如果您不介意div名称,这里有一条单行线:</strong></p>
<pre><code>import re
with open("data.html", "r") as msg:
data = msg.readlines()
data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'\1'+' '+r'\2',v).split()) for v in [v.strip() for v in data if "href" in v]]
</code></pre>
<p>输出:</p>
<pre><code>[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]
</code></pre>
<p>否则:</p>
<pre><code>with open("data.html", "r") as msg:
data = msg.readlines()
div_write = False
href_write = False
wdata = []; odata = []
for line in data:
if '<div class =' in line:
class_name = line.split("<div class =")[1].split(">")[0].strip()
div_write = True
if "</div>" in line and div_write == True:
odata.append(wdata)
wdata = []
div_write = False
if div_write == True and "< a href" in line:
href = line.strip().split("< a href =")[1].split(",")[0].strip()
title = line.strip().split("title =")[1].split(">")[0].strip()
wdata.append(class_name+" "+href+" "+title)
with open("out.dat", "w") as msg:
for wdata in odata:
msg.write("\n".join(wdata)+"\n\n")
</code></pre>
<p>这样,您就可以保存一个文件,在其中跟踪信息和节名</p>
<p>输出:</p>
<pre><code>"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
</code></pre>