如何使用Pandas和pytest进行TDD?

2024-06-24 12:59:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个Python脚本,它通过在一系列数据帧操作(drop、groupby、sum等)中始终使用Pandas来合并报表。假设我从一个简单的函数开始,该函数清除所有没有值的列,它有一个数据帧作为输入和输出:

# cei.py
def clean_table_cols(source_df: pd.DataFrame) -> pd.DataFrame:
   # IMPLEMENTATION
   # eg. return source_df.dropna(axis="columns", how="all")

我想在测试中验证,该函数实际上删除了所有值为空的所有列。因此,我安排了一个测试输入和输出,并使用pandas.testing中的assert_frame_equal函数进行测试:

# test_cei.py
import pandas as pd
def test_clean_table_cols() -> None:
    df = pd.DataFrame(
        {
            "full_valued": [1, 2, 3],
            "all_missing1": [None, None, None],
            "some_missing": [None, 2, 3],
            "all_missing2": [None, None, None],
        }
    )
    expected = pd.DataFrame({"full_valued": [1, 2, 3], "some_missing": [None, 2, 3]})
    result = cei.clean_table_cols(df)
    pd.testing.assert_frame_equal(result, expected)

我的问题是,从概念上讲,它是单元测试还是e2e/集成测试,因为我不是在模拟实现。但是如果我模拟DataFrame,我就不会测试代码的功能。以下TDD最佳实践的推荐测试方法是什么

注:在本项目中使用Pandas是一项设计决策,因此无意抽象Pandas接口,以便将来用其他库替换它


Tags: 数据函数pycleannonesourcedataframepandas
2条回答

是的,这段代码实际上是一个集成测试,这可能不是一件坏事

即使使用pandas是一个固定的设计决策,仍然有很多很好的理由从外部库中提取测试就是其中之一。从外部库进行抽象允许独立于库测试业务逻辑。在这种情况下,从熊猫中提取将使上述内容成为一个单元测试。它将测试与库的交互

要应用此模式,我建议看一下ports and adapters architecture pattern

然而,这确实意味着您不再测试pandas提供的功能。如果这仍然是您的特定意图,那么集成测试不是一个坏的解决方案

您可能会发现tdda(测试驱动数据分析)很有用,引用文档:

The tdda package provides Python support for test-driven data analysis (see 1-page summary with references, or the blog). The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest. The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records. The tdda.rexpy library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy. Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools."

另见Nick Radcliffe's PyData talk on Test-Driven Data Analysis

相关问题 更多 >