The current repository intentionally starts with Binance Spot daily OHLCV because it is easy to audit, reproducible, and sufficient for a strong practical baseline.
But Binance-only still leaves several gaps:
- some assets were liquid elsewhere before Binance listed them
- some earlier cycles are only partially visible in Binance history
- Binance-only history can understate how long a coin has really been a mainstream tradable asset
- large-cap / mainstream filtering would eventually benefit from optional market-cap context
This roadmap deliberately keeps the first phase narrow and practical.
Priority order:
- pre-Binance daily history merge
- alternate exchange daily history merge
- optional market-cap metadata
Not first priority:
- news / sentiment
- complex on-chain features
- social data
- derivatives funding or liquidation features
The repository now includes:
- src/external_data.py
- scripts/validate_external_data.py
- external-data config blocks in config/default.yaml
Implemented concepts:
- provider abstraction for local CSV history sources
- provider abstraction for local CSV metadata sources
- merge rules with explicit source priority
- duplicate-date resolution
- source tagging per row
- optional market-cap metadata loader
The intended default merge behavior is:
- normalize every source into a canonical daily OHLCV schema
- attach
data_sourceanddata_provider - rank sources by configured priority
- on duplicate dates, keep the higher-priority source
- sort the final frame by date
- ensure the merged frame is monotonic and deduplicated
Default priority:
binancealternate_exchangepre_binance
Practical interpretation:
- if Binance and external overlap, Binance wins by default
- if Binance starts late, earlier external rows can extend the history backward
- if Binance has gaps and alternate exchange covers them, alternate exchange can fill those dates
Before external_data.enabled should be turned on in production, the following still need to be validated with real provider data:
- symbol mapping quality across providers
- timestamp normalization and timezone consistency
- split / redenomination / contract migration edge cases
- quote-volume comparability across exchanges
- source coverage stability across major coins
- backtest parity checks between Binance-only and merged-history runs
- live-pool stability comparison between Binance-only and merged-history runs
Recommended evaluation flow:
- run the current Binance-only baseline
- enable external data for a controlled symbol subset
- compare:
- universe size changes
- first-eligible dates
- leader capture metrics
- live pool turnover
- top-N overlap with the Binance-only build
- only expand external usage after point-in-time behavior is verified
The code structure is ready, but real production providers are not yet wired in.
Typical next candidates:
- Kaiko / Coin Metrics / CCXT-backed daily history
- exchange archival CSV dumps
- internal curated pre-Binance history files
- optional market-cap snapshots from a stable metadata provider
The current validation script uses mock local CSV inputs only.