python中大型定界文本文件的快速访问与查询

2024-09-29 01:37:44 发布

您现在位置:Python中文网/ 问答频道 /正文

在搜索了一段时间后,我找到了许多与这个问题相关的问题/答案,但没有找到真正解决我所要寻找的问题的答案。基本上,我正在用python实现代码,以便能够从星表(特别是tycho2星表)中查询信息。你知道吗

这些数据存储在一个较大的文本文件中(约0.5gb),其中每一行对应一个星的条目。你知道吗

下面是一些示例行

0001 00008 1| |  2.31750494|  2.23184345|  -16.3|   -9.0| 68| 73| 1.7| 1.8|1958.89|1951.94| 4|1.0|1.0|0.9|1.0|12.146|0.158|12.146|0.223|999| |         |  2.31754222|  2.23186444|1.67|1.54| 88.0|100.8| |-0.2
0001 00013 1| |  1.12558209|  2.26739400|   27.7|   -0.5|  9| 12| 1.2| 1.2|1990.76|1989.25| 8|1.0|0.8|1.0|0.7|10.488|0.038| 8.670|0.015|999|T|         |  1.12551889|  2.26739556|1.81|1.52|  9.3| 12.7| |-0.2
0001 00016 1| |  1.05686490|  1.89782870|  -25.9|  -44.4| 85| 99| 2.1| 2.4|1959.29|1945.16| 3|0.4|0.5|0.4|0.5|12.921|0.335|12.100|0.243|999| |         |  1.05692417|  1.89793306|1.81|1.54|108.5|150.2| |-0.1
0001 00017 1|P|  0.05059802|  1.77144349|   32.1|  -20.7| 21| 31| 1.6| 1.6|1989.29|1985.38| 5|1.4|0.6|1.4|0.6|11.318|0.070|10.521|0.051| 18|T|         |  0.05086583|  1.77151389|1.78|1.55| 30.0| 45.6|D|-0.2

信息既有分隔符,又有固定宽度。每一列都包含了关于恒星的不同信息。现在,对于我的python实用程序,我希望能够快速搜索这些信息,并检索符合用户指定的一组条件的stars条目。你知道吗

例如,我希望能够找到所有星等大于5.5(第18或19列)的恒星,它们的赤经在0到30度(第3列)之间,赤纬在-45到-35度(第4列)之间。现在,如果我能将所有这些信息存储在内存中,就可以很容易地将文件读入一个numpy结构数组或pandas数据帧,并使用逻辑索引检索我想要的星星。不幸的是,我正在工作的机器没有足够的内存来完成这项工作(在任何给定的时间,我只有大约0.5千兆字节的可用内存,我正在使用的程序的其余部分占用了大量内存)。你知道吗

我目前的解决方案包括遍历文本文件的每一行,解释数据,并且仅当条目与指定的条件匹配时才将其存储在内存中。我的方法是

def getallwithcriteria(self, min_vmag=1., max_vmag=17., min_bmag=1., max_bmag=17., min_ra=0., max_ra=360.,
                       min_dc=-90., max_dc=90., min_prox=3, search_center=None, search_radius=None):
    """
    This method returns entire star records for each star that meets the specified criterion.  The defaults for each
    criteria specify the entire range of the catalogue.  Do not call this without changing the defaults as this will
    likely overflow memory and cause your system to drastically slow down or crash!

    Note that all of the keyword argument do not need to be specified.  For instance, we could run

        import tychopy as tp

        tyc = tp.Tycho('/path/to/catalogue')

        star_records = tyc.getallwithcritera(min_vmag=3,max_vmag=4)

    to return all stars that have a visual magnitude between 3 and 4.

    This method returns a numpy structured array where each element contains the complete record for a star that
    matches the criterion specified by the user.  The output array has the following dtype:

            [('starid', 'U12'),
             ('pflag', 'U1'),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', 'U1'),
             ('hipparcosNumber', 'U9'),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', 'U1'),
             ('correlation', float)]

    see the readme of the Tycho 2 catalogue for a more formal description of each field.

    If no stars are found that match the specified input then an empty numpy array with the above dtype is returned.

    Note that both a rectangular and a circular area can be specified.  The rectangular search area is specified
    using the min_ra/dc max_ra/dc keyword arguments while the circular search area is specified using the
    search_center and search_radius keyword arguments where the search_center is a tuple, list, numpy array, or
    other array like object which contains the center right ascension in element 0 and the center declination in
    element 1.  It is not recommended to specify both the circular and rectangular search areas.  If the search
    areas do not overlap then no stars will be returned.

    :param min_vmag:  the minimum (brightest) visual magnitude to return
    :param max_vmag:  the maximum (dimmest) visual magnitude to return
    :param min_bmag:  the minimum (brightest) blue magnitude to return
    :param max_bmag:  the maximum (dimmest) blue magnitude to return
    :param min_ra:  the minimum right ascension to return
    :param max_ra:  the maximum right ascension to return
    :param min_dc:  the minimum declination to return
    :param max_dc:  the maximum declination to return
    :param min_prox:  the closest proximity to a star to return
    :param search_center: An array like object containing the center point from which to search radially for stars.
    :param search_radius: A float specifying the radial search distance to use
    :return: A numpy structure array containing the star records for stars that meet the specified criteria
    """

    # form the dtype list that genfromtxt will use to interpret the star records
    dform = [('starid', 'U12'),
             ('pflag', 'U1'),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', 'U1'),
             ('hipparcosNumber', 'U9'),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', 'U1'),
             ('correlation', float)]

    # initialize a list which will contain the star record strings for stars that match the input criteria
    records = []

    # loop through each record in the Tycho2 catlogue
    for record in self._catalogueFile:

        # interpret the record as simply as we can
        split_record = record.split(sep="|")

        # check that we are examining a good star, that it falls within the bearing bounds, and that it is far
        # enough away from other stars
        if ("X" not in split_record[1]) and min_ra <= float(split_record[2]) <= max_ra \
                and min_dc <= float(split_record[3]) <= max_dc and int(split_record[21]) >= min_prox:

            # perform the radial search if the user has specified a center and radius
            if search_center is None or pow(pow(float(split_record[2])-search_center[0], 2) +
                                            pow(float(split_record[3])-search_center[1], 2), 1/2.) < search_radius:

                # Check to see if we have values for both blue and visual magnitudes, and check to see if these
                # magnitudes fall within the specified magnitude bounds
                # We need to split this up like this in order to make sure that either the bmag or the vmag exist
                if bool(split_record[17].strip()) and bool(split_record[19].strip()) \
                        and min_bmag <= float(split_record[17]) <= max_bmag \
                        and min_vmag <= float(split_record[19]) <= max_vmag:

                    records.append(record+'\n')

                # if only the visual magnitude exists then check its bounds - also check if the user has specified
                # its bounds
                elif not bool(split_record[17].strip()) and bool(split_record[19].strip()) \
                        and min_vmag <= float(split_record[19]) <= max_vmag and (max_vmag != 17. or min_vmag != 1.):

                    records.append(record+'\n')

                # if only the blue magnitude exists the check its bounds - also check if the user has specified its
                # bounds
                elif not bool(split_record[19].strip()) and bool(split_record[17].strip()) \
                        and min_bmag <= float(split_record[17]) <= max_bmag and (max_bmag != 17. or min_bmag != 1.):

                    records.append(record+'\n')

                # otherwise check to see if the use has changed the defaults.  If they haven't then store the star
                elif max_bmag == 17. and max_vmag == 17. and min_bmag == 1. and min_vmag == 1.:

                    records.append(record+'\n')

    # check to see if any stars met the criteria.  If they didn't then return an empty array.  If they did then use
    # genfromtxt to interpret the string of star records
    if not bool(records):
        nprecords = np.empty((1,), dtype=dform)

        warnings.warn('No stars were found meeting your criteria.  Please try again.')
    else:
        nprecords = np.genfromtxt(BytesIO("".join(records).encode()), dtype=dform, delimiter='|', converters={
            0: lambda s: s.strip(),
            1: lambda s: s.strip(),
            22: lambda s: s.strip(),
            23: lambda s: s.strip(),
            30: lambda s: s.strip()})

        if self._includeProperMotion:
            applypropermotion(nprecords, self.newEpoch, copy=False)

    # reset the catalogue back to the beginning for future searches
    self._catalogueFile.seek(0, os.SEEK_SET)

    return nprecords

这仍然非常慢(尽管比耗尽所有内存并将其他所有内容推入交换要快)。作为比较,每次我需要检索星星大约需要2-3分钟,我需要从这个程序中检索大约40次星星(每次有不同的标准)。程序的其余部分总共需要5秒钟。你知道吗

我现在的问题是,什么是加速这个过程的最好方法(除了得到一台内存更大的更好的计算机之外)。我愿意接受任何建议,只要他们解释得好,不会花我几个月来执行。我甚至愿意写一个函数,通过修改原始目录文件到一个更好的格式(固定宽度二进制文件按特定列排序),以加快事情。你知道吗

到目前为止,我已经考虑memmap'的文件,但决定反对它,因为我真的不认为这将有助于我需要做什么。我也考虑过从数据中创建一个数据库,然后使用sqlalchemy或类似的方法来查询数据;但是,我对数据库不是非常熟悉,不知道这是否能提供任何真正的速度改进。你知道吗


Tags: andthetosearchifthatfloatmin
1条回答
网友
1楼 · 发布于 2024-09-29 01:37:44

正如@wflynny已经提到的,PyTables(HDF5 store)比text/CSV/etc.文件效率更高。除此之外,还可以使用.read_hdf(where='<where condition>')有条件地读取PyTables。你知道吗

您可能需要检查this comparison。如果您的机器是UNIX或Linux,您可能需要检查Feather-Format,这应该非常快。你知道吗

除此之外,我还要检查使用一些RDBMS(MySQL/PostgreSQL/SQLite)加上适当的索引是否会加快速度。但是,如果您只有0.5gbram可用空间,并且希望同时使用Pandas和RDBMS,则可能会出现问题

相关问题 更多 >