从网站动态列表到postgres python

2024-09-29 21:30:33 发布

您现在位置:Python中文网/ 问答频道 /正文

早上好

我想把这个site的数据放进去。我试图得到每个搜索结果的日期、创建者、相关性、描述、主题、受众和访问权限,并将其放入我的postgres数据库。问题是这个描述有时会丢失。所以有时一个结果有6条记录,有时一个结果有7条记录。你知道吗

所以我的问题是:如果不存在描述,我怎么能产生一个空的结果呢。欢迎提供任何建议!你知道吗

到目前为止我的剧本是这样的。如果一个结果上总是有7条记录,它就会填满数据库(我测试了3条记录,请记住)

import urllib.parse
import urllib.request
import re
import sys
import psycopg2 as dbapi

url = 'https://easy.dans.knaw.nl/ui/'
values = {'wicket:bookmarkablePage':':nl.knaw.dans.easy.web.search.pages.PublicSearchResultPage',
          'q' : 'opgraving'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
headers = {}
headers['User-Agent'] =  'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req = urllib.request.Request(url,data, headers =headers)
resp = urllib.request.urlopen(req)
respData = resp.read()


saveRecord= open('C:/Users/berend/Desktop/record.txt','w')
record =  re.findall(r'<dd>(.*?)</dd>',str(respData))
for item in record:
    saveRecord.write("%s\n" % item)
saveRecord.close()

fin = open("C:/Users/berend/Desktop/record.txt",'r')
fit = open("C:/Users/berend/Desktop/record_schoon.txt",'w')
delete_list = ['</em>', '[',']','<em>','</span>', '<span>', '\\n']
for line in fin:
    for word in delete_list:
        line = line.replace(word, "")
    fit.write(line)
fin.close()
fit.close()

open_record= open('C:/Users/berend/Desktop/record_schoon.txt','r')
content = list(open_record)
print(len(content))
open_record.close()

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

前3个结果:

2000
Groenewoudt, B.J.; Deeben, J.H.C.; Velde, H.M. van der
100% relevant
Na verkennend onderzoek in 1996 en een grootschalige opgraving met uitgebreid bodemkundig
opgraving
Archaeology
Open (registered users)
2001-09
Peters, F.J.C.; Peeters, J.H.M.
100% relevant
opgraving
Archaeology
Open (registered users)
2008
Jacobs, E.; Burnier, C.Y.
100% relevant
OPGRAVING
Archaeology
Open (registered users)

Tags: inimporttxtforclosedata记录open
1条回答
网友
1楼 · 发布于 2024-09-29 21:30:33

在这种情况下,我会使用pandassqlalchemy库来提高效率。我建议使用附加包的解决方案,因为您没有指定“不”使用它们。你知道吗

而不是这样:

n = 3
for i in range(0, len(content), 3):
   q= content[i:i+n]
   con = dbapi.connect(database='import', user='postgres', password='xxx')
   cur = con.cursor()
   cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
   con.commit()

用这样的方法:

import pandas as pd
from sqlalchemy import create_engine

# create a connection engine using sqlalchemy
engine = sqla.create_engine('postgresql+psycopg2://postgres:xxx@localhost/import', echo=False)

# read the results file into a pandas DataFrame
df = pd.read_csv('C:/Users/berend/Desktop/record_schoon.txt', delimiter='\t') # or whatever your delimiter is
dfFill = df.fillna("") # "" will be blank space when any record is missing data or 'nan'
dfFill.to_sql("tablename", engine, if_exists="append") #change tablename to the name of your table in import

HTH公司

相关问题 更多 >

    热门问题