The "50,000 SKU Problem" in CPG Demand Forecasting
When our team first sat down with a major CPG client facing the challenge of forecasting demand for their massive portfolio of 50,000+ SKUs, I have to admit - I was a bit intimidated! The traditional approach of building individual time series models for each product would have taken weeks of computation time and likely produced inconsistent results.
But then I remembered something fascinating I'd seen in the M5 forecasting competition (where Kaggle participants competed to forecast Walmart sales) - the top performers weren't using individual models per product. Instead, they were using global, hierarchical approaches that allowed information to flow across related products. This insight changed everything for us!
Why Individual SKU Models Fall Short
Let me walk you through why the traditional approach breaks down at scale:
Imagine you're trying to forecast demand for 50,000 different products. The conventional approach would look something like this in pseudocode:
# Traditional approach - separate model for each SKU
for each sku in all_50000_skus:
sku_data = extract_history_for_this_sku()
sku_model = train_forecasting_model(sku_data)
save_model(sku_model)
# When forecasting time comes:
for each sku in all_50000_skus:
load_model_for_this_sku()
generate_forecast()
This seems logical at first glance - each SKU gets its own dedicated model. But there are serious problems with this approach:
- Computation nightmare: Training and maintaining 50,000 separate models is incredibly resource-intensive
- Cold-start problem: New or low-volume SKUs have insufficient data for reliable forecasting
- No cross-learning: A cereal product can't learn from patterns observed in similar cereals
- Inconsistent hierarchies: Category-level forecasts don't naturally align with the sum of individual SKU forecasts
- Reinventing the wheel: Each model has to independently discover the same seasonal patterns
Our "Aha!" Moment: The Global Approach
The breakthrough came when we realized we could flip the problem on its head. Instead of building a model per SKU, what if we built a few powerful models that could predict any SKU's demand based on its characteristics and historical patterns?
Here's the conceptual approach we developed:
# Step 1: Transform data from time-series to feature-based format
feature_data = []
for each sku, date in historical_data:
features = extract_features(sku, date) # extract day of week, month, lag features, etc.
sales = get_sales(sku, date)
feature_data.append([sku, date, features, sales])
# Step 2: Build hierarchical models at different levels
company_model = train_model(all_feature_data)
for each category in product_categories:
category_data = filter_data_by_category(feature_data, category)
category_models[category] = train_model(category_data)
for each subcategory in product_subcategories:
subcategory_data = filter_data_by_subcategory(feature_data, subcategory)
subcategory_models[subcategory] = train_model(subcategory_data)
# Step 3: Forecast using an ensemble of predictions with learned weights
# First, learn optimal weights on validation data
validation_predictions = []
for each sku, date in validation_period:
features = extract_features(sku, date)
# Get predictions from each level
company_prediction = company_model.predict(features)
category_prediction = category_models[sku.category].predict(features)
subcategory_prediction = subcategory_models[sku.subcategory].predict(features)
# Store all predictions alongside actual values
validation_predictions.append([
sku, date, actual_sales,
company_prediction, category_prediction, subcategory_prediction
])
# Learn optimal weights using gradient descent or a meta-model
ensemble_weights = optimize_weights_to_minimize_error(validation_predictions)
# Now use these learned weights for final forecasting
for each sku, future_date in forecast_horizon:
features = extract_features(sku, future_date)
# Get predictions from each level
company_prediction = company_model.predict(features)
category_prediction = category_models[sku.category].predict(features)
subcategory_prediction = subcategory_models[sku.subcategory].predict(features)
# Apply learned weights (which might vary by category or even SKU!)
weights = ensemble_weights[sku.category]
final_prediction = (weights.company * company_prediction +
weights.category * category_prediction +
weights.subcategory * subcategory_prediction)
This approach was transformative! Instead of 50,000 models, we only needed about 20-30 models (1 company-level, 5 category-level, and 20 subcategory-level). And the magic happens because each model sees data across many products, learning common patterns that individual models would miss.
The Secret Sauce: Feature Engineering
The real power in our approach comes from how we transform time series data into features that help our models understand each SKU's behavior. Here are some of the key features we engineered:
Time-Based Features
We encoded cyclical time patterns using sine and cosine transformations:
# Instead of using raw day of week (0-6), we encode it circularly
day_of_week_sin = sin(2π * day_of_week / 7)
day_of_week_cos = cos(2π * day_of_week / 7)
# Same for month of year (1-12)
month_sin = sin(2π * month / 12)
month_cos = cos(2π * month / 12)
This circular encoding ensures our models understand that Sunday (day 6) is adjacent to Monday (day 0) in the weekly cycle.
Lag and Window Features
We captured historical patterns with various lag and window features:
# Recent values
sales_lag7 = sales from 7 days ago for this SKU
sales_lag14 = sales from 14 days ago for this SKU
sales_lag28 = sales from 28 days ago for this SKU
# Moving averages
sales_mean7 = average sales for past 7 days for this SKU
sales_mean28 = average sales for past 28 days for this SKU
Hierarchical Features
This is where the cross-learning magic happens:
# Category-level signals
category_avg_sales = average sales across all SKUs in this category for this date
category_trend = % change in category sales over past 28 days
# Subcategory-level signals
subcategory_avg_sales = average sales across all SKUs in this subcategory for this date
subcategory_seasonality = seasonal index for this subcategory at this time of year
Statistical Encodings for SKU IDs
Here's a game-changing technique we implemented: instead of treating SKU IDs as mere categorical variables, we encoded them with statistical properties of their historical time series using tsfresh. This was absolutely revolutionary for our models:
# For each SKU, extract statistical features from its time series
for each sku in all_skus:
sku_history = get_sales_history(sku)
# Extract 40+ statistical properties using tsfresh
sku_stats = {
'mean': mean of historical sales,
'std': standard deviation of sales,
'trend': linear trend coefficient,
'seasonality_strength': seasonality test statistic,
'autocorrelation_7': autocorrelation at lag 7,
'autocorrelation_14': autocorrelation at lag 14,
'peak_frequencies': dominant frequencies from FFT,
'quantiles': [10%, 25%, 50%, 75%, 90%] of distribution,
'kurtosis': measure of "tailedness" of sales distribution,
... # many more statistical features
}
# Store these statistical encodings to use as features
sku_statistical_encodings[sku] = sku_stats
This approach allowed our models to understand the "character" of each SKU's behavior. Instead of starting from scratch with each SKU ID, the model could immediately understand: "This is a high-volume, low-variability SKU with strong weekly seasonality but minimal monthly seasonality."
These hierarchical features and statistical encodings allow information to flow across related products. If all breakfast cereals show a sales spike during back-to-school season, our models can learn this pattern at the subcategory level and apply it even to newly introduced cereal products!
The Model Selection Journey: Finding the Perfect Algorithm
One of the most challenging aspects of this project was selecting the right algorithm for our hierarchical models. We knew this decision would be critical, so we conducted extensive experiments:
# Pseudocode for our model selection process
algorithms = ['RandomForest', 'XGBoost', 'LightGBM', 'Neural Networks', 'Linear Models']
hyperparameter_grids = {...} # Different grids for each algorithm
best_performance = float('inf')
best_model = None
for algorithm in algorithms:
for hyperparameter_set in grid_search(hyperparameter_grids[algorithm]):
# Train with this algorithm and hyperparameters
model = train_with_cross_validation(algorithm, hyperparameter_set)
performance = evaluate_model(model)
if performance < best_performance:
best_performance = performance
best_model = (algorithm, hyperparameter_set)
This process was incredibly time-consuming! Training neural networks alone took over a week, as we experimented with different architectures, from simple MLPs to more complex temporal fusion transformers.
Random Forests performed well but were slow at inference time with millions of trees. Linear models were blazing fast but couldn't capture complex interactions.
XGBoost emerged as our champion - offering the best balance of accuracy, training speed, and inference performance. The key hyperparameters we found most impactful were:
- A moderate tree depth (6-8)
- A relatively high number of estimators (800-1000)
- A modest learning rate (0.01-0.05)
- Careful subsampling and feature sampling
The Results: Mind-Blowing Improvements
When we implemented this approach for our CPG client, the results honestly blew us away:
- Accuracy Boost: Overall forecast error (MAPE) dropped by 15.3% compared to their previous approach
- Speed Gains: What used to take 72+ hours now completed in just 5 hours (93% faster!)
- New Product Handling: Forecast accuracy for products with less than 6 months of history improved by 27%
- Consistent Hierarchies: No more mismatches between SKU, subcategory, and category forecasts
Let me share a real example: For one particular beverage SKU with highly variable sales, the individual model approach yielded a poor MAPE of 42%. Our global approach reduced this to just 23% - simply because the model could leverage learning from similar beverages that experienced the same seasonality and promotion effects!
What We Learned from the M5 Competition
The M5 forecasting competition provided incredible insights that fueled our approach:
- Feature engineering trumps model complexity - The best solutions used relatively simple models with rich features
- Hierarchical learning is essential - No top solution used pure individual SKU modeling
- Ensembles are powerful - Combining predictions from different hierarchical levels provides robustness
- Test setup matters - Using time-based validation that mimics real forecasting scenarios is critical
Challenges We Faced (and How We Solved Them)
No implementation this complex comes without challenges. Here's how we tackled the major ones:
Challenge 1: Data Sparsity for New SKUs
For brand new products with no history, we created a "product similarity engine" that identified existing products with similar attributes and borrowed their seasonal patterns. This reduced new product forecast error by nearly 40%!
Challenge 2: Computational Requirements
Even with fewer models, processing features for billions of SKU-day combinations was intense. We implemented a distributed Spark pipeline that processed features in parallel, cutting computation time dramatically.
Challenge 3: Handling Promotions
Promotions create huge spikes that can throw off models. We developed a specialized promotional module that estimates promotion lift at the category level, then adjusts it for each SKU based on historical promotion response.
Conclusion: Global is the Future
The shift from local to global modeling represents a fundamental change in how we approach large-scale forecasting problems. By leveraging hierarchical relationships and allowing information to flow across related products, we achieved significant improvements in both accuracy and efficiency.
If you're dealing with large-scale forecasting challenges (whether in retail, CPG, or any industry with many forecasting targets), I strongly encourage you to consider a global approach. The computational savings alone are worth it, but the accuracy improvements will truly transform your planning capabilities.
In future posts, I'll dig deeper into other aspects of this approach:
- How we handle promotional events within the global framework
- Incorporating external factors like weather and economic indicators
- Deploying these models in production environments
- Communicating forecast uncertainty to business stakeholders
Have you experimented with global forecasting approaches? I'd love to hear about your experiences in the comments!