#!/usr/bin/python3
import urllib.request
import quopri
import lxml.html
# actual test fragments are here
raw_url = 'https://gist.github.com/Supermathie/7866658/raw/80e4abd4226b916a54b224677af7fda881d0937f/sample+1'
raw_url_no_sig = 'https://gist.github.com/Supermathie/7866658/raw/df354d6b8f3176c3d8bdb89b2961bb0ccc78520c/sample+2'
def get_divs(url):
email_body_raw = urllib.request.urlopen(url).read()
email_body = quopri.decodestring(email_body_raw)
email_xml = lxml.html.document_fromstring(email_body)
email_divs = email_xml.xpath('//div[@id="_signaturePlaceholder"]/preceding-sibling::div')
return email_divs
print('\n'.join([str(node.text_content() or "") for node in get_divs(raw_url)]))
print('\n'.join([str(node.text_content() or "") for node in get_divs(raw_url_no_sig)]))
对于两个测试用例,打印:
Let's remember that the information in the article was filtered through no less than two people who don't fully speak tech. I think I can translate it back:
«The FBI crafted a custom piece of malware targeting Mo, designed to snoop his activities . A link was emailed to Mo in a spear phishing attack in an attempt to get hin to download and install the malware from the FBI's monitored servers.
The attempt failed; the software was downloaded but never executed in a manner enabling the software to send back information to the FBI.»
Nothing too special. I wonder if Mo had the balls to submit the software to Sophos etc. for malware analysis. :)
使用python和xpath从HTML中提取文本:
对于两个测试用例,打印:
以及
相关问题 更多 >
编程相关推荐