Reddit - r/MachineLearning

TSAuditor: A time-series auditing framework [P]

Background

This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported a 3% missing data rate for volume columns.

I didn't think much about it because I assumed it was noise, as this was my first time working with time-series data. However, the downstream models weren't acting right. That's when I suspected something was off, and I actually looked at the data. I found that the 3% missing data was not noise - in fact, it was a 6-day stretch of missing data.

The Problem

It didn't stop there. The data also had leakage, and the model hit 99% accuracy. The rolling windows and lag features were also messed up, as the chronological sequence was broken.

Looking back, if I had done proper EDA, this would not have happened.

The Solution: TSAuditor

I decided to build a small validation tool called tsauditor that catches:

  • Chronological breaks
  • Leakage
  • Sudden sequential spikes present in global boundaries

It also adds a description along with evidence on why the data point is faulty and suggests fixes.

Features

  • Open source
  • Lightweight
  • Available on PyPI
  • Includes an example notebook with a side-by-side comparison of tsauditor against a standard profiling tool
  • Can be used without defining a domain

You can also check out the comparison notebook. Link in comments.

Comments

No comments yet. Start the discussion.