我正在编写一系列脚本,这些脚本从数据库中提取url,并使用textstat package根据一组预定义的计算来计算页面的可读性。下面的函数获取一个url(来自CouchDB),计算定义的可读性分数,然后将分数保存回相同的CouchDB文档。在
我遇到的问题是错误处理。举个例子,Flesch阅读简易分数计算需要计算页面上的句子总数。如果返回为零,则引发异常。有没有一种方法可以捕捉到这个异常,在数据库中保存异常的注释,然后转到列表中的下一个URL?我可以在下面的函数中执行此操作(首选),还是需要编辑包本身?在
我知道以前有人问过这个问题。如果你知道一个可以回答我问题的人,请告诉我这个方向。到目前为止,我的搜寻没有结果。提前谢谢。在
def get_readability_data(db, url, doc_id, rank, index):
readability_data = {}
readability_data['url'] = url
readability_data['rank'] = rank
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = { 'User-Agent' : user_agent }
try:
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
content = response.read()
readable_article = Document(content).summary()
soup = BeautifulSoup(readable_article, "lxml")
text = soup.body.get_text()
try:
readability_data['flesch_reading_ease'] = textstat.flesch_reading_ease(text)
readability_data['smog_index'] = textstat.smog_index(text)
readability_data['flesch_kincaid_grade'] = textstat.flesch_kincaid_grade(text)
readability_data['coleman_liau'] = textstat.coleman_liau_index(text)
readability_data['automated_readability_index'] = textstat.automated_readability_index(text)
readability_data['dale_chall_score'] = textstat.dale_chall_readability_score(text)
readability_data['linear_write_formula'] = textstat.linsear_write_formula(text)
readability_data['gunning_fog'] = textstat.gunning_fog(text)
readability_data['total_words'] = textstat.lexicon_count(text)
readability_data['difficult_words'] = textstat.difficult_words(text)
readability_data['syllables'] = textstat.syllable_count(text)
readability_data['sentences'] = textstat.sentence_count(text)
readability_data['readability_consensus'] = textstat.text_standard(text)
readability_data['readability_scores_date'] = time.strftime("%a %b %d %H:%M:%S %Y")
# use the doc_id to make sure we're saving this in the appropriate place
readability = json.dumps(readability_data, sort_keys=True, indent=4 * ' ')
doc = db.get(doc_id)
data = json.loads(readability)
doc['search_details']['search_details'][index]['readability'] = data
#print(doc['search_details']['search_details'][index])
db.save(doc)
time.sleep(.5)
except: # catch *all* exceptions
e = sys.exc_info()[0]
write_to_page( "<p>---ERROR---: %s</p>" % e )
except urllib.error.HTTPError as err:
print(err.code)
这是我收到的错误:
^{2}$这是调用函数的代码:
if __name__ == '__main__':
db = connect_to_db(parse_args())
print("~~~~~~~~~~" + " GETTING IDs " + "~~~~~~~~~~")
ids = get_ids(db)
for i in ids:
details = get_urls(db, i)
for d in details:
get_readability_data(db, d['url'], d['id'], d['rank'], d['index'])
通常,保持
try: except:
块越小越好。我将把你的textstat
函数包装在某种修饰符中,它捕捉到预期的异常,并返回函数输出和捕捉到的异常。在例如:
印刷品:
^{pr2}$相关问题 更多 >
编程相关推荐