When traditional SQL databases and Excel files fall short, we empower your analytics with flexible, scalable infrastructure designed to accelerate and deepen insights. Our solutions break through data limitations, enabling seamless integration and rapid processing of complex workflows.

We can’t use this data

What began as a small project to support analysis quickly turned into a mission-critical data engineering initiative.

Analysts had the questions, stakeholders needed the insights — but the data wasn’t yet accessible. MTGJSON offered a vast and incredibly rich dataset for Magic: The Gathering — everything from pricing and printings to mechanics and formats. But the raw files were built for developers, not analysts. Without pipelines to convert and shape the data, it was effectively out of reach for tools like Tableau, SQL, and Excel — the platforms our analysts rely on daily.

That’s where engineering stepped in.

This wasn’t about cleaning data for a one-time report. It was about unlocking access — giving analysts and stakeholders the power to explore, test hypotheses, and surface trends without waiting on manual data prep.

So we designed a modular, crawl-walk-run framework. One that prioritized quick wins, encouraged iteration, and kept engineering, analytics, and product aligned.

Step One: Getting to the Data

The first major hurdle was just getting the data into a usable shape. With formats ranging from .json.bz2 to .tar.bz2 and .csv, and with schemas varying across files, ingestion wasn’t trivial. We needed a pipeline that could support flexibility, scale, and rapid iteration.

We built modular Airflow DAGs that orchestrated Spark-based extraction functions for each file type. By supporting dynamic task groups and fallback CSV outputs, we made sure analysts could validate data early—no waiting on engineering for schema alignment.

This modular approach gave us immediate wins: raw data was quickly decompressed, normalized, and stored. Price data, which benefits from fast reads and complex joins, was written in Parquet. For everything else, we used CSV for transparency and debugging.

Step Two: Making the Data Trustworthy

Once we had the data, the next problem was trust. Joins were failing silently. IDs that looked identical in Excel weren’t matching in Spark. After a few deep dives, we realized issues like hidden whitespace, encoding artifacts, and inconsistent casing were breaking everything.

We developed utility functions to clean and normalize UUIDs, applied them at every stage, and introduced safeguards for inspecting schemas across sources. This might seem minor, but it unblocked entire workflows and let analysts move with confidence.

Step Three: Building a Feature Platform

Our final push was toward dynamic feature engineering. Rather than hardcoding new features every time a request came in, we built a registry-driven system that lets analysts define features using a JSON configuration. These features are computed on-demand using Spark, and the results are instantly usable in downstream workflows.

This gave analysts autonomy without sacrificing consistency. They could define filters, rankings, and aggregations—all without touching the pipeline code. We knew this wasn’t the end state, but it let us build a platform where experimentation could thrive.

The Bigger Picture

This wasn’t just about getting card prices or set data into a table. It was about removing friction between the people who ask the questions and the infrastructure that can answer them. By investing in modularity, rapid ingestion, and analyst-friendly tooling, we turned a messy dataset into a living system that supports insight, exploration, and product thinking.

The work isn’t done. We’ll continue to refine performance, reduce recompute, and build smarter materialization strategies. But what started as a workaround is now a foundation—and one that’s already powering analysis across multiple teams.