使用Py2n将大型xml文件导入Neo4j

from xml.dom import minidom from py2neo import Graph, Node, Relationship, authenticate from py2neo.packages.httpstream import http http.socket_timeout = 9999 import codecs authenticate("localhost:7474", "neo4j", "******") graph = Graph("http://localhost:7474/db/data/") xml_file = codecs.open("User_profilesL2T1.xml","r", encoding="latin-1") xml_doc = minidom.parseString (codecs.encode (xml_file.read(), "utf-8")) #xml_doc = minidom.parse(xml_file) persons = xml_doc.getElementsByTagName('user') label1 = "USER" # Adding Nodes for person in persons: if person.getElementsByTagName("id")[0].firstChild: Id_User=person.getElementsByTagName("id")[0].firstChild.data else: Name="NO ID" print ("******************************USER***************************************") print(Id_User) print ("*************************") if person.getElementsByTagName("name")[0].firstChild: Name=person.getElementsByTagName("name")[0].firstChild.data else: Name="NO NAME" # print("Name :",Name) print ("*************************") if person.getElementsByTagName("screen_name")[0].firstChild: Screen_name=person.getElementsByTagName("screen_name")[0].firstChild.data else: Screen_name="NO SCREEN_NAME" # print("Screen Name :",Screen_name) print ("*************************") if person.getElementsByTagName("location")[0].firstChild: Location=person.getElementsByTagName("location")[0].firstChild.data else: Location="NO Location" # print("Location :",Location) print ("*************************") if person.getElementsByTagName("description")[0].firstChild: Description=person.getElementsByTagName("description")[0].firstChild.data else: Description="NO description" # print("Description :",Description) print ("*************************") if person.getElementsByTagName("profile_image_url")[0].firstChild: Profile_image_url=person.getElementsByTagName("profile_image_url")[0].firstChild.data else: Profile_image_url="NO profile_image_url" # print("Profile_image_url :",Profile_image_url) print ("*************************") if person.getElementsByTagName("friends_count")[0].firstChild: Friends_count=person.getElementsByTagName("friends_count")[0].firstChild.data else: Friends_count="NO friends_count" # print("Friends_count :",Friends_count) print ("*************************") if person.getElementsByTagName("url")[0].firstChild: URL=person.getElementsByTagName("url")[0].firstChild.data else: URL="NO URL" # print("URL :",URL) node1 = Node(label1,ID_USER=Id_User,NAME=Name,SCREEN_NAME=Screen_name,LOCATION=Location,DESCRIPTION=Description,Profile_Image_Url=Profile_image_url,Friends_Count=Friends_count,URL=URL) graph.merge(node1)

2条回答

网友

1楼 · 编辑于 2024-05-05 21:54:08

如果要将数据导入到新数据库中，可能需要尝试导入工具：https://neo4j.com/docs/operations-manual/current/#import-tool

在这种情况下，您应该像以前一样解析XML文件，但是不要使用py2neo将数据插入Neo4j，只需编写一个CSV文件，然后调用导入工具。在

请参见下面的一种可能的方法：

import csv
from xml.dom import minidom

def getAttribute(node,attribute,default=None):
    attr = node.getElementsByTagName(attribute)[0]
    return attr.firstChild.data if attr.firstChild else default

xml_doc = minidom.parse(open("users.xml"))
persons = xml_doc.getElementsByTagName('user')

users = []
attrs = ['name','screen_name','location','description','profile_image_url','friends_count','url']

mapping = {'user_id': 'user_id:ID(User)',
           'name': 'name:string',
           'screen_name': 'screen_name:string',
           'location': 'location:string',
           'description': 'description:string',
           'profile_image_url': 'profile_image_url:string',
           'friends_count': 'friends_count:int',
           'url': 'url:string'}

with open('users.csv','w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=mapping.values())
    writer.writeheader()
    for person in persons:
        user = {mapping[attr]: getAttribute(person, attr) for attr in attrs}
        user[mapping['user_id']] = getAttribute(person, 'id')

        writer.writerow(user)

将xml转换为csv文件后，请运行导入工具：

^{pr2}$

我想您还需要创建节点之间的关系（？）。您应该阅读导入工具文档，并为节点和关系调用带有csv文件的导入工具

网友

2楼 · 编辑于 2024-05-05 21:54:08

我认为您应该使用流式解析器，否则甚至在python端，您可能会溢出内存。在

另外，我建议在Neo4j中进行事务处理，每次事务更新10k到100k。在

不要存储"NO xxxx"字段，只需将它们保留下来，这只是浪费空间和精力。在

我不知道合并（节点）是如何工作的。我建议在：User（userId）上创建一个唯一的约束，并使用如下cypher查询：

UNWIND {data} as row
MERGE (u:User {userId: row.userId}) ON CREATE SET u += {row}

其中{data}参数是具有属性的字典列表（例如10k个条目）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章