我有两个数据帧,叫做df1,df2,但当我试图连接它时,它无法完成。让我为每个数据帧建立模式,并为每个数据帧提供示例输出。在
df1
Out[160]: DataFrame[BibNum: string, CallNumber: string, CheckoutDateTime: string, ItemBarcode: string, ItemCollection: string, ItemType: string]
Row(BibNum=u'BibNum', CallNumber=u'CallNumber', CheckoutDateTime=u'CheckoutDateTime', ItemBarcode=u'ItemBarcode', ItemCollection=u'ItemCollection', ItemType=u'ItemType'),
Row(BibNum=u'1842225', CallNumber=u'MYSTERY ELKINS1999', CheckoutDateTime=u'05/23/2005 03:20:00 PM', ItemBarcode=u'10035249209', ItemCollection=u'namys', ItemType=u'acbk')]
df2
DataFrame[Author: string, BibNum: string, FloatingItem: string, ISBN: string, ItemCollection: string, ItemCount: string, ItemLocation: string, ItemType: string, PublicationDate: string, Publisher: string, ReportDate: string, Subjects: string, Title: string]
[Row(Author=u'Author', BibNum=u'BibNum', FloatingItem=u'FloatingItem', ISBN=u'ISBN', ItemCollection=u'ItemCollection', ItemCount=u'ItemCount', ItemLocation=u'ItemLocation', ItemType=u'ItemType', PublicationDate=u'PublicationYear', Publisher=u'Publisher', ReportDate=u'ReportDate', Subjects=u'Subjects', Title=u'Title'),
Row(Author=u"O'Ryan| Ellie", BibNum=u'3011076', FloatingItem=u'Floating', ISBN=u'1481425730| 1481425749| 9781481425735| 9781481425742', ItemCollection=u'ncrdr', ItemCount=u'1', ItemLocation=u'qna', ItemType=u'jcbk', PublicationDate=u'2014', Publisher=u'Simon Spotlight|', ReportDate=u'09/01/2017', Subjects=u'Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction', Title=u"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.")]
当我试图使用以下命令将两个连接起来时:
^{pr2}$,没有错误,但数据帧如下所示,其中包含重叠的列:
DataFrame[BibNum: string, CallNumber: string, CheckoutDateTime: string, ItemBarcode: string, ItemCollection: string, ItemType: string, Author: string, BibNum: string, FloatingItem: string, ISBN: string, ItemCollection: string, ItemCount: string, ItemLocation: string, ItemType: string, PublicationDate: string, Publisher: string, ReportDate: string, Subjects: string, Title: string]
最后,在我得到df3(joined dataframe)之后,当我尝试df3.take(2)时,出现了错误:list index out of range
。
我所寻找的结果是,我想通过计算借阅天数(checkoutDateTime)最多的书籍来找出哪个项目的位置是可用的。在
您需要在公共列上联接dataframe,否则它将从2个不同的dataframe生成2个同名的冲突列。在
您可以根据需要使用外部联接或左联接。请不要为同一问题问多个问题。您已经在:when trying to join two tables, happening IndexError: list index out of range in pyspark
相关问题 更多 >
编程相关推荐