使用Python解码URL内容

2024-05-17 11:58:29 发布

您现在位置:Python中文网/ 问答频道 /正文

有一个网站,其中包含希腊语和希伯来语在俄语圣经翻译。 下面是获取url内容的python2代码:

# -*- coding: utf-8 -*-
import io, urllib

f1 = io.open('url.txt','w',encoding='utf8')
#link = "http://manuscript-bible.ru/OT/Gen1.htm"
link = "http://manuscript-bible.ru/S/H/psa117.htm"
f2 = urllib.urlopen(link)
myfile = f2.read().decode("utf-8")
f1.write(myfile)
f1.close()

因此,对于使用f2.read().decode("utf-8")的希伯来语和俄语,可以从http://manuscript-bible.ru/S/H/psa117.htm获取url,例如:

>1</strong>
<a target="_blank" href="../S/h19.htm#1984" title="hалэлу">הַֽלְלוּ</a>
<a target="_blank" href="../S/h08.htm#853" title="ʼэт">אֶת</a>

对于俄语,也可以使用myfile = f2.read().decode('cp1251')http://manuscript-bible.ru/RSV/22_116.htm获取内容 这里有一句话的内容.htm

<b>1</b> Хвалите
<01984> Господа
    <03068>, все народы
        <01471>, прославляйте
            <07623> Его, все племена
                <0523>;

问题在于此url http://manuscript-bible.ru/OT/Ps116.htm无法解码: 从.htm的来源来看,似乎有两个Gen1.htm具有相同的扩展名.htm 其中一个包含如下内容:

anot=0;bn=22;cn=116;variants="";cr=new Array();parsing="";a=new Array("","","","","","","","
<a name=116>","","","","
    <br>","","","","
    <br>","CALMOI","calmoi","6163","\u041F\u0421\u0410\u041B\u041C\u042B","\u0413\u043B\u0430\u0432\u0430 116","","","","","","","
    <br>","","","","
    <br>","","","","","1","","","","Allhlouia.","allhlouia","6C61","\u0410\u043B\u043B\u0438\u043B\u0443\u0439\u044F.","A\u042Ene\u042Dte","a\u042Ene\u042Dte","6961","\u0425\u0432\u0430\u043B\u0438&#769\u0442\u0435","t\u0442n","t\u0441n","6F74","-","k\u0436rion,","k\u0436rion","756B","\u0413\u043E&#769\u0441\u043F\u043E\u0434\u0430,"

似乎是utf-16编码,因为如果插入https://www.branah.com/unicode-converter 给予

anot=0;bn=22;cn=116;variants="";cr=new Array();parsing="";a=new 

Array("","","","","","","","
<a name=116>","","","","
    <br>","","","","
    <br>","CALMOI","calmoi","6163","ПСАЛМЫ","Глава 116","","","","","","","
    <br>","","","","
    <br>","","","","","1","","","","Allhlouia.","allhlouia","6C61","Аллилуйя.","AЮneЭte","aЮneЭte","6961","Хвали&#769те","tтn","tсn","6F74","-","kжrion,","kжrion","756B","Го&#769спода,"

问题是,如何使用类似python的代码obove获取上述urlhttp://manuscript-bible.ru/OT/Ps116.htm的解码内容?你知道吗


Tags: brhttpurl内容newruarrayutf