我正在尝试在6000多个图像上运行azureocrapi。不幸的是,代码在90个图像之后就停止了。在
文档:
输入:6000多张图片(.png)
期望输出:
错误消息:ConnectionError:HTTPSConnectionPool(host='westcentralus.api.认知.微软.com',port=443):超过了url:/vision/v2.0/ocr的最大重试次数?language=unk&detectorationation=true(由NewConnectionError(':无法建立新连接:[Errno-2]名称或服务未知',)
我在每10张图片后提供了60秒的延迟,这最好能满足每分钟20个事务的配额。在
import warnings
warnings.filterwarnings("ignore")
import glob
import os
import requests
import pandas as pd
import time
# Replace the value of subscription_key with your subscription key.
subscription_key = "{key}"
assert subscription_key
# Replace the value of vision_base_url (not necessary for trial version)
vision_base_url="https://westcentralus.api.cognitive.microsoft.com/vision/v2.0/"
analyze_url = vision_base_url + "ocr"
# Initializing Source and Output Directories
source_directory = glob.glob('folder/with/6000/images/*.png')
output_directory_textFiles = 'folder/for/saving/6000/textFiles/'
output_directory_JSONFiles = 'folder/for/saving/6000/JSONFiles/'
if not os.path.exists(output_directory_textFiles):
os.makedirs(output_directory_textFiles)
if not os.path.exists(output_directory_JSONFiles):
os.makedirs(output_directory_JSONFiles)
# Define Function for Extracting Text
def extract_text(image_path):
# Read the image into a byte array
image_data = open(image_path, "rb").read()
headers = {'Ocp-Apim-Subscription-Key': subscription_key,'Content-Type': 'application/octet-stream'}
params = {'language': 'unk', 'detectOrientation': 'true'}
response = requests.post(analyze_url, headers=headers, params=params, data=image_data)
analysis = response.json()
# Extract the word bounding boxes and text.
line_infos = [region["lines"] for region in analysis["regions"]]
word_infos = []
for line in line_infos:
for word_metadata in line:
for word_info in word_metadata["words"]:
word_infos.append(word_info)
return(word_infos)
# Generating Text and JSON Files
counter = 0
for image in sorted(source_directory):
counter += 1
print(r'Processing %d %s' %(counter, image))
word_infos = extract_text(image)
filename = image.split('/')[-1].replace('.png', '')
if len(word_infos) != 0:
bboxOutput = pd.DataFrame(word_infos)
bboxOutput[['x','y', 'width','height']] = bboxOutput['boundingBox'].str.split(',',expand=True)
bboxOutput = bboxOutput.drop(['boundingBox'], axis=1)
textFile = bboxOutput['text']
textFile = textFile.to_csv(r'{}/{}.txt'.format(output_directory_textFiles, filename), header = False, index = None, sep = ',')
jsonFile = bboxOutput.to_json(orient = 'records')
with open(r'{}/{}.txt'.format(output_directory_JSONFiles, filename), 'w') as f:
f.write(jsonFile)
f.close()
else:
word_infos = pd.DataFrame(word_infos)
textFile = word_infos.to_csv(r'{}/{}.txt'.format(output_directory_textFiles, filename), header = False, index = None, sep = ',')
jsonFile = word_infos.to_json(orient = 'records')
with open(r'{}/{}.txt'.format(output_directory_JSONFiles, filename), 'w') as f:
f.write(jsonFile)
f.close()
if (counter % 10) == 0:
time.sleep(60)
else:
pass
我建议您尝试将您的
time.sleep
调用改为3
或4
秒,而不是10张图片后的60
秒。在虽然免费层的限制是每分钟20个,但是付费的限制是每秒10个,所以有可能你达到了一个限制,扰乱了限制机制(你的代码可能会在不到一秒钟内发送10个图像)。在
希望有帮助!
相关问题 更多 >
编程相关推荐