ProStaff Scraper - Professional Match Data API

FastAPI service that collects and serves League of Legends professional match data. Fetches schedules from LoL Esports API, enriches with per-player stats from Leaguepedia, and stores everything in Elasticsearch for fast REST queries.

Features

FastAPI REST API — serve professional match data via HTTP endpoints
Two-phase live pipeline — sync (LoL Esports) + background enrichment (Leaguepedia)
Full player stats — champion, KDA, gold, CS, items (names), runes (names), summoner spells
Leaguepedia integration — only public source for competitive game data (Riot Match-V5 does not expose tournament server games)
Enrichment daemon — background job processes pending games every 30 minutes, respects rate limits
Deduplication — riot_enriched flag prevents re-processing; enrichment_attempts counter abandons after 3 failures
Historical backfill — imports ALL editions of a league from Leaguepedia (CBLOL since 2013, resumable)
Oracle's Elixir ingest — bulk-indexes OE CSV exports (97K+ games, all major leagues); idempotent (op_type=create)
Oracle's Elixir backfill — fills missing stats (damage_taken, wards) on Leaguepedia docs using OE CSV as join source
Multi-league — CBLOL, CBLOL Academy, Circuito Desafiante, LCS, LEC, LCK, LPL, and more
Production ready — Docker Compose with Traefik/SSL for Coolify deployment

Architecture

The system has four independent pipelines that all write to the same ES index (lol_pro_matches):

Phase 1 — Live Sync (scraper-cron, every 1h)
  LoL Esports API
    └─ getCompletedEvents → series with games + YouTube VOD IDs
         └─ competitive_pipeline.py
              └─ bulk_index → ES (riot_enriched: false)

Phase 2 — Enrichment (enrichment-daemon, every 30min)
  query_unenriched(ES) → pending games
    └─ For each game (2 Leaguepedia requests + 9s sleep each):
         1. ScoreboardGames  → page_name, winner, patch, gamelength
         2. ScoreboardPlayers → 10 players with champion/KDA/items/runes
         └─ update_document(ES, riot_enriched: true, participants: [...])

Phase 3 — Historical Backfill (one-off / daily cron via Rails Sidekiq)
  Leaguepedia Tournaments table → all OverviewPages for a league
    └─ historical_backfill.py (resumable — persists progress to JSON)
         └─ For each tournament not yet completed:
              └─ leaguepedia_pipeline.py → bulk_index → ES

Phase 4 — Oracle's Elixir Ingest (one-off / manual re-run)
  OE CSV files (annual download, 97K+ games, all major leagues)
    └─ oracles_elixir_ingest.py
         └─ op_type='create' → ES (idempotent, skips duplicates)
         └─ oracles_elixir_backfill.py → fills damage_taken/wards on existing docs

Why Leaguepedia instead of Riot Match-V5: competitive games run on Riot's internal tournament servers and do not appear in the public Match-V5 API. Leaguepedia receives official data from Riot's esports disclosure program and is the only public source for these stats.

Why Oracle's Elixir: broader league coverage and additional stat columns (damage_taken, wards_placed, wards_killed) not always available via Leaguepedia. OE CSVs are a login-gated annual download — re-run ingest when a new year's CSV is available.

For the full architecture diagram and detailed flow, see docs/Arquitetura.md.

API Endpoints

Public

GET /health                        # Health check (Elasticsearch connectivity)
GET /                              # Service info
GET /api/v1/leagues                # List leagues from LoL Esports
GET /api/v1/matches?league=CBLOL   # Query matches (paginated)
GET /api/v1/matches/{match_id}     # Single match with full participant stats
GET /api/v1/stats/leagues          # Match count per league

Protected (requires `X-API-Key` header)

POST /api/v1/sync?league=CBLOL&limit=50        # Trigger manual sync
POST /api/v1/enrich?batch=10                   # Trigger background enrichment
GET  /api/v1/enrich/status                     # Enrichment progress (pending/enriched counts)

Example — Enriched Match

GET /api/v1/matches/115565621821672075_2

{
  "match_id": "115565621821672075",
  "game_number": 2,
  "league": "CBLOL",
  "patch": "26.02",
  "win_team": "Leviatan",
  "gamelength": "32:43",
  "game_duration_seconds": 1963,
  "riot_enriched": true,
  "participants": [
    {
      "summoner_name": "tinowns",
      "team_name": "paiN Gaming",
      "champion_name": "Ahri",
      "role": "Mid",
      "kills": 4, "deaths": 1, "assists": 3,
      "gold": 14320, "cs": 245, "damage": 22100,
      "win": false,
      "items": ["Rabadon's Deathcap", "Shadowflame", "Void Staff"],
      "keystone": "Electrocute",
      "primary_runes": ["Cheap Shot", "Eyeball Collection", "Treasure Hunter"],
      "secondary_runes": ["Presence of Mind", "Cut Down"],
      "stat_shards": ["Adaptive Force", "Adaptive Force", "Health"],
      "summoner_spells": ["Flash", "Ignite"]
    }
  ]
}

See full Swagger UI at https://scraper.prostaff.gg/docs

Quick Start

# 1. Copy and configure environment
cp .env.example .env
# Edit .env: add RIOT_API_KEY, ESPORTS_API_KEY, SCRAPER_API_KEY

# 2. Start services (Elasticsearch + API + enrichment daemon)
docker compose up -d

# 3. Verify health
curl http://localhost:8000/health

# 4. Sync CBLOL matches
curl -X POST "http://localhost:8000/api/v1/sync?league=CBLOL&limit=20" \
  -H "X-API-Key: your-key"

# 5. Check enrichment progress (daemon runs automatically every 30min)
curl "http://localhost:8000/api/v1/enrich/status" \
  -H "X-API-Key: your-key"

# 6. Query enriched matches
curl "http://localhost:8000/api/v1/matches?league=CBLOL&limit=5"

Production Deployment

Deploy to Coolify: see DEPLOYMENT.md for full guide.

Summary

Create Docker Compose application in Coolify
Point to repository with docker-compose.production.yml
Configure environment variables (see Environment Variables)
Set domain: scraper.prostaff.gg
Deploy and verify: curl https://scraper.prostaff.gg/health

First deploy — index creation

The lol_pro_matches Elasticsearch index is created automatically on first sync. If deploying over an existing installation with the old schema (pre-Leaguepedia), delete the index first so it is recreated with the updated mapping:

curl -X DELETE https://your-elasticsearch-host:9200/lol_pro_matches

Stack

Component	Technology
Framework	FastAPI 0.115 (async REST API)
Server	Uvicorn (ASGI)
Language	Python 3.11
HTTP client	httpx + tenacity (retry/backoff)
Data validation	Pydantic 2.9
Storage	Elasticsearch 8.x
Deployment	Docker Compose + Traefik (Coolify)
Data sources	LoL Esports Persisted Gateway, Leaguepedia Cargo API, Oracle's Elixir CSV

File Structure

ProStaff-Scraper/
├── api/
│   └── main.py                      # FastAPI: all endpoints
├── providers/
│   ├── esports.py                   # LoL Esports Gateway API client
│   ├── leaguepedia.py               # Leaguepedia Cargo API client
│   │                                #   get_game_scoreboard() + get_game_players()
│   ├── riot.py                      # Riot Account/Match V5 client
│   └── riot_rate_limited.py         # Riot client with rate limit tiers
├── etl/
│   ├── competitive_pipeline.py      # Phase 1: live sync from LoL Esports
│   ├── enrichment_pipeline.py       # Phase 2: enrich from Leaguepedia (daemon)
│   ├── historical_backfill.py       # Phase 3: full league history from Leaguepedia
│   ├── leaguepedia_pipeline.py      # Leaguepedia game + player indexer (used by phases 2 & 3)
│   ├── oracles_elixir_ingest.py     # Phase 4: bulk ingest from Oracle's Elixir CSVs
│   ├── oracles_elixir_backfill.py   # Phase 4b: fill missing stats using OE CSV join
│   └── historical_data_migration.py # One-off migration helper (legacy)
├── indexers/
│   ├── elasticsearch_client.py      # ES helpers (bulk, update, query_unenriched)
│   └── mappings.py                  # Index mappings (participant fields are strings)
├── docs/
│   └── Arquitetura.md               # Full architecture documentation
├── docker-compose.yml               # Development (ES + Kibana + API + enrichment)
├── docker-compose.production.yml    # Production (Coolify + Traefik, 3 services)
├── Dockerfile.production            # Production Docker image
├── DEPLOYMENT.md                    # Coolify deployment guide
├── QUICKSTART.md                    # 5-minute setup guide
├── requirements.txt                 # Python dependencies (elasticsearch==8.13.1)
└── .env.example                     # Environment variables template

Environment Variables

See .env.example for the full template.

Required

Variable	Description
`ESPORTS_API_KEY`	LoL Esports Persisted Gateway key (for sync)
`RIOT_API_KEY`	Riot Games API key (for sync, not needed for enrichment)
`SCRAPER_API_KEY`	Secret key to protect write endpoints (sync, enrich)

Optional

Variable	Default	Description
`ELASTICSEARCH_URL`	`http://elasticsearch:9200`	ES connection URL
`DEFAULT_PLATFORM_REGION`	`BR1`	Default Riot platform region
`API_PORT`	`8000`	FastAPI server port
`CORS_ALLOWED_ORIGINS`	`https://api.prostaff.gg,...`	Comma-separated allowed origins

Scraper cron settings

Variable	Default	Description
`SYNC_LEAGUES`	`CBLOL`	Space-separated leagues to sync
`SYNC_INTERVAL_HOURS`	`1`	Sync interval in hours
`SYNC_LIMIT`	`100`	Match limit per league per run

Note: RIOT_API_KEY is only used by the sync pipeline to call LoL Esports endpoints. The enrichment daemon uses Leaguepedia anonymously — no API key required.

ETL scripts (oracles_elixir_ingest.py, historical_backfill.py) must be run with the project's .venv (Python 3.11, elasticsearch==8.13.1). The system-wide elasticsearch package may be v9 and is incompatible with an ES 8.x server. Create the venv once:
python3.11 -m venv .venv && .venv/bin/pip install -r requirements.txt

Troubleshooting

GET /health returns 503

Elasticsearch is still starting. Wait 30s and retry.

docker logs prostaff-scraper-elasticsearch-1 | tail -20

GET /api/v1/matches returns empty

Run a sync first:

curl -X POST "http://localhost:8000/api/v1/sync?league=CBLOL&limit=20" \
  -H "X-API-Key: your-key"

Enrichment stuck — all games at enrichment_attempts: 3

Leaguepedia may not have data for these games yet (common for very recent matches). They will be picked up automatically on the next daemon run after Leaguepedia updates. To reset attempts and force retry:

# Reset attempts for all games (use with care)
curl -X POST http://localhost:9200/lol_pro_matches/_update_by_query \
  -H "Content-Type: application/json" \
  -d '{"query":{"range":{"enrichment_attempts":{"gte":3}}},"script":{"source":"ctx._source.enrichment_attempts=0"}}'

Leaguepedia rate limit errors in logs

Expected behavior during rapid testing. The enrichment daemon respects 9s between requests. Errors automatically retry up to 3 times before incrementing enrichment_attempts.

401 Unauthorized on sync/enrich endpoints

Ensure X-API-Key header matches SCRAPER_API_KEY in your .env.

Elasticsearch mapping conflict after upgrading from old schema

The participant fields changed from integer IDs to string names. Delete and recreate:

curl -X DELETE http://localhost:9200/lol_pro_matches
# Restart API and run sync — index is recreated automatically

Integration with ProStaff API

The Rails API (prostaff-api) talks to this scraper in two ways:

Live match sync — ProStaffScraperService calls the scraper's REST API:

POST /api/v1/sync — trigger a sync run
GET /api/v1/enrich/status — poll enrichment progress
Used by SyncScraperMatchesJob and HistoricalBackfillJob

Direct ES queries — ElasticsearchClient queries the shared lol_pro_matches index directly:

GET /competitive/pro-matches/match-preview — per-game picks + stats for a recent series
GET /competitive/pro-matches/es-series — H2H history between two teams
The data lake (97K+ games) is populated by all four pipelines above

Setup:

Set SCRAPER_API_URL=https://scraper.prostaff.gg in the Rails API environment
Set ELASTICSEARCH_URL to the same ES instance in both repos
See PROSTAFF_SCRAPER_INTEGRATION_ANALYSIS.md for the full integration guide

Running Oracle's Elixir ingest (requires .venv — ES client v8):

cd /path/to/ProStaff-Scraper
ELASTICSEARCH_URL=https://user:pass@elastic.example.com \
  .venv/bin/python etl/oracles_elixir_ingest.py --years 2026

# Re-run is safe — duplicate gameids are skipped (op_type=create)
# To add a new year's CSV: download from oracleselixir.com and re-run with --years <year>

Resources

Full deployment guide: DEPLOYMENT.md
Quick start: QUICKSTART.md
Architecture: docs/Arquitetura.md
API docs (Swagger): https://scraper.prostaff.gg/docs

License

CC BY-NC-SA 4.0 — Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
api		api
docs		docs
elasticsearch/config		elasticsearch/config
etl		etl
indexers		indexers
pipelines		pipelines
providers		providers
.env.example		.env.example
.env.production.example		.env.production.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.production		Dockerfile.production
LICENSE		LICENSE
README.md		README.md
champions.json		champions.json
deploy_production.sh		deploy_production.sh
docker-compose.production.yml		docker-compose.production.yml
docker-compose.yml		docker-compose.yml
export.txt		export.txt
proScrape.py		proScrape.py
publicScape.py		publicScape.py
requirements.txt		requirements.txt
reset_enrichment_attempts.py		reset_enrichment_attempts.py
scrapeTimelines.py		scrapeTimelines.py
synergyFb.py		synergyFb.py
test_production.py		test_production.py
validate_historical_data.py		validate_historical_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProStaff Scraper - Professional Match Data API

Table of Contents

Features

Architecture

API Endpoints

Public

Protected (requires `X-API-Key` header)

Example — Enriched Match

Quick Start

Production Deployment

Summary

First deploy — index creation

Stack

File Structure

Environment Variables

Required

Optional

Scraper cron settings

Troubleshooting

Integration with ProStaff API

Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProStaff Scraper - Professional Match Data API

Table of Contents

Features

Architecture

API Endpoints

Public

Protected (requires X-API-Key header)

Example — Enriched Match

Quick Start

Production Deployment

Summary

First deploy — index creation

Stack

File Structure

Environment Variables

Required

Optional

Scraper cron settings

Troubleshooting

Integration with ProStaff API

Resources

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Protected (requires `X-API-Key` header)

Packages