使用Python从PowerPivot模型提取原始数据

2024-06-01 12:51:52 发布

您现在位置:Python中文网/ 问答频道 /正文

当我不得不使用Python从PowerPivot模型中读取一些数据时,看似微不足道的任务变成了真正的噩梦。我相信在过去的几天里我已经对此做了很好的研究,但是现在我遇到了困难,希望Python/SSAS/ADO社区能提供一些帮助。

基本上,我只想通过编程方式访问存储在PowerPivot模型中的原始数据-我的想法是通过下面列出的方法之一连接到底层PowerPivot(即MS Analysis Services)引擎,列出模型中包含的表,然后使用一个简单的DAX查询(类似于EVALUATE (table_name))从每个表中提取原始数据。轻松点,对吧?嗯,也许不是。

0个。一些背景资料

如你所见,我试过几种不同的方法。我将尽可能仔细地记录所有内容,以便那些不熟悉PowerPivot功能的人能够很好地了解我想做什么。

首先,介绍对Analysis Services引擎的编程访问的一些背景知识(上面说的是2005sql Server,但所有这些都应该仍然适用):SQL Server Data Mining ProgrammabilityData providers used for Analysis Services connections

我将在下面的示例中使用的示例Excel/PowerPivot文件可以在这里找到:Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples

另外,请注意,我使用的是Excel 2010,所以我的一些代码是特定于版本的。E、 如果使用Excel 2013,g.wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection应该是wb.Model.DataModelConnection.ModelConnection.ADOConnection

在这个问题中,我将使用的连接字符串基于这里的信息:Connect to PowerPivot engine with C#。此外,有些方法显然需要在数据检索之前对PowerPivot模型进行某种初始化。请看这里:Automating PowerPivot Refresh operation from VBA

最后,这里有几个链接表明这是可以实现的(但是,请注意,这些链接主要是指C#,而不是Python):

一。使用ADOMD

import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()

这里,问题似乎是PowerPivot模型尚未初始化:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.

2。使用AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)

同样的故事,“服务器没有运行”:

ConnectionException: A connection cannot be made. Ensure that the server is running.

注意,AMO在技术上不用于查询数据,但我将其作为连接到PowerPivot模型的潜在方法之一。

三。使用ADO.NET

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()

这类似于What's the simplest way to access mssql with python or ironpython?。不幸的是,这也不起作用:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.

四。通过adodbapi模块使用ADO

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)

类似于Opposite Workings of OLEDB/ODBC between Python and MS Access VBA。我得到的错误是:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...

这与上面的ADO.NET基本相同。

5个。通过Excel/win32com模块使用ADO

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)

这种方法的思想来自这个使用VBA:Export a table or DAX query from Power Pivot to CSV using VBA的博客文章。注意,这种方法使用一个显式的刷新命令来初始化模型(即“服务器”)。下面是错误消息:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)

但是,ADO连接似乎已建立:

  • type(Connection)返回instance
  • print(Connection)返回Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue

问题似乎在于ADODB.Recordset对象的创建。

6。通过Excel/win32com使用ADO,直接使用ADODB.Connection

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)

类似于Connection to Access from Python [duplicate]Query access using ADO in Win32 platform (Python recipe)。不幸的是,Python抛出的错误与上面两个示例中的相同:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)

7号。通过Excel/win32com使用ADO,直接使用ADODB.Connection+模型刷新

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)

我希望可以初始化一个Excel实例,然后初始化PowerPivot模型,然后使用Excel用于嵌入PowerPivot数据的内部连接字符串创建连接(类似于How do you copy the powerpivot data into the excel workbook as a table?-请注意,连接字符串与我在其他地方使用的字符串不同)。不幸的是,这不起作用,我猜是Python启动了ADODB。在一个单独的实例中的连接过程(当我在没有初始化Excel的情况下执行最后三行时得到相同的错误消息,等等):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)

Tags: the模型importdbdataerrorconnectionprovider
3条回答

我和汤姆·格雷森(又名戈班·萨奥)取得了联系,他很友好,让我把他的电子邮件发到这里。有一些有趣的掘金在他们,所以希望其他人也会发现他们有用。

电子邮件#1

When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

电子邮件#2

You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

瞧,我终于解决了这个问题——结果发现,使用Python访问Power Pivot数据确实是可能的!下面是我所做的简短回顾-你可以在这里找到更详细的描述:Analysis Services (SSAS) on a shoestring。注意:代码已经过优化,既没有效率也没有优雅。

  • 安装Microsoft Power BI Desktop(附带免费的AnalysisServices服务器,因此不需要昂贵的SQL server许可证—但是,如果您有适当的许可证,同样的方法显然也可以工作)。
  • 首先创建msmdsrv.ini设置文件,然后从ABF文件还原数据库(使用AMO.NET),然后使用ADOMD.NET提取数据,从而启动AS引擎。

下面是说明AS engine+AMO.NET部分的Python代码:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

数据提取部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

然后通过以下方式提取原始数据:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)

从PowerPivot中获取数据的问题是PowerPivot中的表格引擎在Excel中运行,而连接到该引擎的唯一方法是让代码也在Excel中运行。(我怀疑它可能使用共享内存或其他传输,但它绝对不会监听TCP端口或命名管道或其他允许外部进程连接的东西)

我们在Dax Studio中通过在Excel中运行C#VSTO Excel加载项来实现这一点。不过,这只是为了测试分析查询而设计的,而不是为了进行批量数据提取。我们使用字符串变量将数据从外接程序封送到UI,因此整个数据集必须小于2Gb,否则响应将被截断,您将看到一个“无法识别的响应”错误(数据被序列化为一个相当冗长的XML行集,因此在仅提取几百Mb数据时可能会看到它中断)

如果您想构建一个脚本来自动从模型中提取所有原始数据,我认为您不能用Python来完成,因为我不相信您可以让Python解释器在Excel中运行。我想看看使用这样的vba宏http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

您应该会发现,您可以在模型中查询具有类似“SELECT*FROM$SYSTEM.DBSCHEMA_tables”的表列表,然后可以在每个表上循环并使用上述链接中的代码变体进行提取。

相关问题 更多 >