无法从websi下载某些文本

2024-09-30 10:34:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从Death Row Website下载所有最后的语句。基本轮廓是这样的 1.来自站点的信息被导入sqlite数据库,即监狱.sqlite 2.根据表中的名称,我为每个名称生成唯一的URL,以获取它们的最后语句。 3.程序检查每个生成的URL,如果URL正常,则检查最后一条语句。该语句被下载到数据库jailon.sqlite(仍然是2 do)

这是我的密码:

import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = ["http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/moselydaroycelast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999288.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/hernandezadophlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/carterrobertanthonylast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/livingstoncharleslast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/wilkersonrichardlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/hererraleonellast.html",]

conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()

cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( Execution text, link1 text, Statements text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()


csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
    cur.execute('INSERT INTO  Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )

for column in cur.execute("SELECT LastName, Firstname FROM prison"):
    lastname = column[0]
    firstname = column[1]
    name = lastname+firstname
    CleanName = name.translate(None, ",.!-@'#$" "")
    CleanName = CleanName.replace(" ", "")
    CleanName = CleanName.replace("III","")
    CleanName = re.sub("Sr","",CleanName)
    CleanName = re.sub("Jr","",CleanName)
    CleanName = CleanName.lower()
    Baseurl = "http://www.tdcj.state.tx.us/death_row/dr_info/"
    Link = Baseurl+CleanName+"last.html"
    URLS.append(Link)


    for Link in URLS:
        try:
            r = requests.get(Link)
            r.raise_for_status()
            print "URL OK", Link
            document = urllib2.urlopen(Link)
            html = document.read()
            soup = BeautifulSoup(html)
            Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
            print Statement
            continue
        except requests.exceptions.HTTPError as err:
            print err
            print "Offender has made no statement.", Link
            #cur.execute("INSERT OR IGNORE INTO prison(Statements) VALUES(?)"), (Statement, )

csvfile.close()
conn.commit()
conn.close()

运行程序时,我得到:

C:\python>prison.py
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html
Can you hear me? Did I ever tell you, you have dad's eyes? I've noticed that in the last couple of days. I'm sorry for putting you through all this. Tell everyone I love them. It was good seeing the kids. I love them all; tell mom, everybody. I am very sorry for all of the pain. Tell Brenda I love her. To everybody back on the row, I know you're going through a lot over there. Keep fighting, don't give up everybody.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html
Thank you, Jesus Christ. Thank you for your blessing. You are above the president. And know it is you, Jesus Christ, that is performing this miracle in my life. Hallelujah, Holy, Holy, Holy. For this reason I was born and raised. Thank you for this, my God is a God of Salvation. Only through you, Jesus Christ, people will see that you're still on the throne. Hallelujah, Holy, Holy, Holy. I invoke Your name. Thank you, Yahweh, thank you Jesus Christ. Hallelujah, Amen. Thank you, Warden.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html
Traceback (most recent call last):
  File "C:\python\prison.py", line 60, in <module>
    Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
AttributeError: 'NoneType' object has no attribute 'findNext'

前两个语句很好,但之后程序崩溃了。查看发生错误的URL的页面源,我看到: (仅相关数据)

<div class="return_to_div"></div>
<h1>Offender Information</h1>
<h2>Last Statement</h2>
<p class="text_bold">Date of Execution:</p>
<p> February 4, 2009</p>
<p class="text_bold"> Offender:</p>
<p> Martinez, David</p>
<p class="text_bold"> Last Statement:</p>
<p> Yes, nothing I can say can change the past. I am asking for forgiveness. Saying sorry is not going to change anything. I hope one day you can find peace. I am sorry for all of the pain that I have caused you for all those years. There is nothing else I can say, that can help you. Mija, I love you. Sis, Cynthia, and Sandy, keep on going and it will be O.K. I am sorry to put you through this as well. I can't change the past. I hope you find peace and know that I love you. I am sorry. I am sorry and I can't change it.  </p>

是什么导致了这个问题。我必须在这条线上换些东西吗

Statement = soup.find(text="Last Statement:").findNext('p').contents[0]

请随意分享对我的代码的改进。现在,我想让一切工作之前,我会使它更强大

对于那些想知道名单上有网址的人来说:这是由于死囚牢房网站上的一些漏洞造成的。有时URL与[lastname][firstname]last.html不同。我现在是手动添加的


Tags: textinfoyouhttpforhtmlwwwrow

热门问题