首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python单元测试断言2数据帧

Python单元测试断言2数据帧
EN

Stack Overflow用户
提问于 2019-09-24 16:32:27
回答 1查看 355关注 0票数 2

我正在为PySpark做一个写作单元测试。下面是实际的函数。

代码语言:javascript
复制
def get_some_timestamp(self, final_set):
    final_set.createOrReplaceTempView("session_data")
    session_df = self.spark.sql("""SELECT \
                        id,\
                        date(sent_at) as date_without_timestamp, \
                        sent_at as date_time,\
                        CAST(lag(sent_at) OVER (PARTITION BY id, date(sent_at) ORDER BY sent_at) as timestamp) as prev_timestamp,\
                        FROM session_data""")
    return session_df

该函数的UnitTest如下所示:-

代码语言:javascript
复制
   def test_get_some_timestamp(self):
    test_data_df = self.spark.createDataFrame(
    [
     ('1234','2019-01-01T23:01:01.123Z','pageview'),
     ('4567','2019-01-02T23:01:02.123Z','pageview'),
     ('1234','2019-01-01T23:03:01.123Z','click'),
     ('1234','2019-01-01T20:01:01.123Z','pageview'),
     ('4567','2019-01-02T18:01:10.678Z','pageview'),
     ('7890','2019-01-01T23:01:01.123Z','pageview')
    ],
     ['id', 'sent_at','event_name']
    )
    expected_output_pandas_df = pd.DataFrame({'id':['1234','4567','1234','1234','4567','7890'],
                                            'date_without_timestamp':['2019-01-01','2019-01-02','2019-01-01','2019-01-01','2019-01-02','2019-01-01'],
                                            'date_time':['2019-01-01T23:01:01.123','2019-01-02T23:01:02.123','2019-01-01T23:03:01.123','2019-01-01T20:01:01.123','2019-01-02T18:01:10.678','2019-01-01T23:01:01.123'],
                                            'prev_timestamp':[pd.to_datetime('2019-01-01T20:01:01.123'),'2019-01-02 18:01:10.678','2019-01-01T23:01:01.123','NaT','NaT','NaT'],
                                            'event_name':['pageview','pageview','click','pageview','pageview','pageview'],
                                             })

    actual_output_pandas_df = get_some_timestamp(self,test_data_df).toPandas()
    self.assert_equal_with_sort(expected_output_pandas_df,actual_output_pandas_df,['id','date_time'])

我的assert函数如下:

代码语言:javascript
复制
def assert_equal_with_sort(self, results, expected, keycolumns):
    results_sorted = results.sort_values(by=keycolumns).reset_index(drop=True)
    expected_sorted = expected.sort_values(by=keycolumns).reset_index(drop=True)
    assert_frame_equal(results_sorted, expected_sorted)

现在,当我运行这个单元测试时,它失败了,错误如下:-

代码语言:javascript
复制
Traceback (most recent call last):
  File "/Users/neilshah/Documents/GitCode/ms_data_etl/tests/test_utm_session_tagging.py", line 161, in test_get_previous_activity_timestamp
    self.assert_equal_with_sort(expected_output_pandas_df,actual_output_pandas_df,['anonymous_id','date_time'])
  File "/Users/neilshah/Documents/GitCode/ms_data_etl/tests/test_utm_session_tagging.py", line 77, in assert_equal_with_sort
    assert_frame_equal(results_sorted, expected_sorted,check_frame_type=False,check_dtype=False,check_index_type=False,check_column_type=False,check_datetimelike_compat=True)
  File "/Users/neilshah/anaconda3/lib/python3.6/site-packages/pandas/util/testing.py", line 1348, in assert_frame_equal
    obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
  File "/Users/neilshah/anaconda3/lib/python3.6/site-packages/pandas/util/testing.py", line 1216, in assert_series_equal
    check_dtype=check_dtype)
  File "/Users/neilshah/anaconda3/lib/python3.6/site-packages/pandas/util/testing.py", line 1087, in assert_numpy_array_equal
    _raise(left, right, err_msg)
  File "/Users/neilshah/anaconda3/lib/python3.6/site-packages/pandas/util/testing.py", line 1081, in _raise
    raise_assert_detail(obj, msg, left, right)
  File "/Users/neilshah/anaconda3/lib/python3.6/site-packages/pandas/util/testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different

numpy array values are different (100.0 %)
[left]:  [2019-01-01, 2019-01-01, 2019-01-01, 2019-01-02, 2019-01-02, 2019-01-01]
[right]: [2019-01-01, 2019-01-01, 2019-01-01, 2019-01-02, 2019-01-02, 2019-01-01]

我尝试添加这里给出的不同参数https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html,但似乎不起作用。

我还打印了两个dataframe的数据类型。除了prev_timestamp之外,所有列的类型都是object,这两个数据帧的类型都是datetime64[ns]

有人能帮我吗?

EN

回答 1

Stack Overflow用户

发布于 2019-09-26 14:07:39

看起来使用相同的数据类型并没有什么帮助。如果我们要比较除String之外的任何数据类型,数据类型必须完全匹配。因此,在我的例子中,它是date

我的解决方法如下:

代码语言:javascript
复制
    expected_output_pandas_df = pd.DataFrame(
    {
     'id':['1234','4567','1234','1234','4567','7890'],
     'date_without_timestamp':[pd.to_datetime('2019-01-01').date(),pd.to_datetime('2019-01-02').date(),'pd.to_datetime('2019-01-01').date(),pd.to_datetime('2019-01-01').date(),pd.to_datetime('2019-01-02').date(),pd.to_datetime('2019-01-01').date()],
     'date_time':[pd.to_datetime('2019-01-01T23:01:01.123'),'2019-01-02T23:01:02.123','2019-01-01T23:03:01.123','2019-01-01T20:01:01.123','2019-01-02T18:01:10.678','2019-01-01T23:01:01.123'],
     'prev_timestamp':[pd.to_datetime('2019-01-01T20:01:01.123'),'2019-01-02 18:01:10.678','2019-01-01T23:01:01.123','NaT','NaT','NaT'],
     'event_name':['pageview','pageview','click','pageview','pageview','pageview'],
    }
)

对于整数类型,我也遇到了类似的问题。解决问题的方法是

代码语言:javascript
复制
some_pandas_df = pd.DataFrame({'some_int_value':[pd.to_numeric('123456'),pd.to_numeric('543214')]})
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58076154

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档