TSAuditor: A time-series auditing framework [P]
Background
This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported a 3% missing data rate for volume columns.
I didn't think much about it because I assumed it was noise, as this was my first time working with time-series data. However, the downstream models weren't acting right. That's when I suspected something was off, and I actually looked at the data. I found that the 3% missing data was not noise - in fact, it was a 6-day stretch of missing data.
The Problem
It didn't stop there. The data also had leakage, and the model hit 99% accuracy. The rolling windows and lag features were also messed up, as the chronological sequence was broken.
Looking back, if I had done proper EDA, this would not have happened.
The Solution: TSAuditor
I decided to build a small validation tool called tsauditor that catches:
- Chronological breaks
- Leakage
- Sudden sequential spikes present in global boundaries
It also adds a description along with evidence on why the data point is faulty and suggests fixes.
Features
- Open source
- Lightweight
- Available on PyPI
- Includes an example notebook with a side-by-side comparison of tsauditor against a standard profiling tool
- Can be used without defining a domain
You can also check out the comparison notebook. Link in comments.
Comments
No comments yet. Start the discussion.