diff --git a/README.ko.md b/README.ko.md new file mode 100644 index 0000000..3d79fa3 --- /dev/null +++ b/README.ko.md @@ -0,0 +1,163 @@ +# Lang2SQL + +> **자연어로 물으면 SQL을 짜주는 오픈소스 데이터 에이전트.** +> 단, 깨끗하게 정리된 DB가 아니라 — 컬럼 설명이 비어 있고, 팀마다 용어가 다른 +> **현실의 지저분한 DB**에서도 동작하는 걸 목표로 합니다. + +📄 English: [`README.md`](README.md) · 🧭 전체 그림: [`docs/PROJECT.md`](docs/PROJECT.md) · 🏗️ 구조: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) + +--- + +## 한 줄 요약 + +디스코드에서 봇에게 **자연어로 질문하면 SQL을 만들어 실행하고 답**해줍니다. +다른 텍스트-투-SQL과 다른 점은 "질문→SQL" 그 자체가 아니라, 그 **주변**입니다: + +- **🧩 빈 메타데이터 자동 채우기 (enrich)** — 컬럼 설명이 없어도, 에이전트가 *실제 값*을 읽어 "이 컬럼이 무슨 뜻인지 / 어느 테이블과 이어지는지"를 추론해 채웁니다. +- **🗂️ 팀마다 다른 용어 정의 (federation)** — 같은 "활성 고객"이 마케팅과 재무에서 다른 뜻이어도 충돌 없이 공존합니다. 회사 공통 정의 위에 팀별 정의를 얹고, **가까운 정의가 이깁니다(개인 > 팀 > 전사)**. +- **🛡️ 안전장치** — 모든 쿼리는 실행 전 검사를 통과해야 하고, 읽기(SELECT)만 허용합니다. + +> Discord는 1단계(Phase 1) 인터페이스일 뿐, 본질이 아닙니다. Slack/Web은 같은 코어 위에 어댑터만 추가합니다. + +--- + +## 빠른 시작 1 — 오프라인 데모 (토큰·DB 불필요) + +가장 빠르게 핵심을 보는 방법. 디스코드 토큰도, 실제 DB도 필요 없습니다. + +```bash +uv sync # 가상환경 + 의존성 설치 +.venv/bin/python bench/ecommerce_demo.py # federation + safety 데모 +``` + +같은 용어가 채널마다 다른 정의로 풀리는 federation 장면과, 위험한 쿼리(DROP/INSERT)가 막히고 SELECT만 통과하는 안전장치를 보여줍니다. + +## 빠른 시작 2 — CLI (개발자용) + +```bash +.venv/bin/lang2sql "테이블 목록 보여줘" +``` + +`OPENAI_API_KEY`가 있으면 `gpt-4.1-mini`로, 없으면 오프라인 `FakeLLM`(정해진 동작만, 실제 추론 X)으로 동작합니다. + +--- + +## 디스코드 봇 셋업 (자세히) + +### 0. 준비물 +- Python **3.10 이상**, [uv](https://docs.astral.sh/uv/) +- 디스코드 계정 + 봇을 초대할 서버(길드) + +### 1. 설치 +```bash +git clone https://github.com/CausalInferenceLab/lang2sql.git +cd lang2sql +uv sync +``` + +### 2. 디스코드 봇 만들기 +1. [Discord Developer Portal](https://discord.com/developers/applications) → **New Application** +2. 왼쪽 **Bot** 탭 → **Reset Token** → 토큰 **복사** (이게 `DISCORD_BOT_TOKEN`) +3. 같은 화면에서 **Privileged Gateway Intents → MESSAGE CONTENT INTENT** 켜기 (멘션 질문을 읽으려면 필요) +4. **OAuth2 → URL Generator** → scopes에 `bot` + `applications.commands` 체크 → 권한(읽기/메시지 보내기 등) 선택 → 생성된 URL로 봇을 **테스트 서버에 초대** + +### 3. 환경변수 설정 +`.env.example`을 복사해 `.env`를 만들고 채웁니다: + +```bash +cp .env.example .env +``` + +```ini +DISCORD_BOT_TOKEN=여기에_봇_토큰 # 필수 +OPENAI_API_KEY=sk-... # 실제 답변용 (없으면 가짜 LLM으로 떨어짐) +LANG2SQL_SECRET_KEY= # 선택 — 비밀(예: DB 비번) 암호화용 Fernet 키 +LANG2SQL_DATA_PATH=lang2sql_data.db # 선택 — 정의·세션 영속화 파일 (없으면 기본값) +LANG2SQL_SYNC_COMMANDS=true # 슬래시 명령(/setup 등) 등록 +LANG2SQL_DB_URL= # 선택 — 모든 채널이 쓸 기본 DB (아래 참고) +``` + +Fernet 키가 필요하면 생성: +```bash +.venv/bin/python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" +``` + +> ⚠️ 앱은 `.env`를 자동으로 읽지 않습니다. 실행 직전에 셸에 로드하세요: +> `set -a; source .env; set +a` + +### 4. 봇 실행 +```bash +set -a; source .env; set +a # .env를 환경변수로 로드 +.venv/bin/lang2sql-bot +``` +`DISCORD_BOT_TOKEN`이 없으면 명확한 에러를 내고 종료합니다. 정상이면 게이트웨이에 연결돼 서빙을 시작합니다. + +### 5. DB 연결 +두 가지 방법: + +**(A) 디스코드에서 `/setup`** — 비개발자용 가이드 폼. DSN을 직접 타이핑하지 않아도 됩니다. +- `/setup` → DB 종류 선택 → 폼 작성 → 연결 테스트 후 **암호화 저장** +- 지원: **PostgreSQL · MySQL · BigQuery · Snowflake · DuckDB · Cloudflare D1** +- 예) DuckDB → path 칸에 `/절대경로/파일.duckdb` + +**(B) 환경변수 `LANG2SQL_DB_URL`** — 봇 실행 전에 걸면 모든 채널이 그 DB를 씁니다. +```ini +LANG2SQL_DB_URL=postgresql://user:pw@host:5432/db +# 또는 +LANG2SQL_DB_URL=duckdb:////절대/경로/파일.duckdb # 슬래시 4개 = 절대경로 +``` + +> `/connect`는 V1에서 **저장만 하고 실제 연결은 안 되는** 미완성 명령입니다. 실제 연결은 `/setup`을 쓰세요. + +### 6. 사용 +- **자연어 질문** — 채널에서 봇을 멘션하거나 DM: `@Lang2SQL 국가별 매출 알려줘` +- **`/enrich`** — 컬럼 의미·테이블 관계 자동 보강 (질문 품질이 크게 올라감) +- **`/term_custom`** — 비즈니스 용어 등록/조회/삭제 +- **`/org_setup`** — DB 스캔으로 용어 자동 추출 (`org:`=전사, `team:`=이 채널) + +--- + +## 슬래시 명령어 + +| 명령 | 설명 | +|---|---| +| `/setup` | DB 연결 (가이드 폼, DSN 불필요) — **실제 연결 경로** | +| `/enrich` | 컬럼 메타데이터 자동 보강 (`clear:true`로 초기화) | +| `/term_custom` | 비즈니스 용어 등록·조회(`action:show`)·삭제(`action:remove`) | +| `/org_setup` | 조직(`org:`)/팀(`team:`) 등록 + DB 스캔 용어 자동 추출 | +| `/remember` | 사실/선호를 기억 | +| `/ingest` | 문서에서 정의 후보 제안 | +| `/audit_me` | 내 최근 활동 보기 | +| `/connect` | (V1 미완성 — 저장만 함, 쓰지 말 것) | + +자연어 질문은 슬래시가 아니라 **멘션/DM**으로. 에이전트가 필요하면 위 도구들을 스스로 호출합니다. + +--- + +## 지금 되는 것 / 아직인 것 (정직하게) + +**됩니다** +- 3계층 federation (전사/팀/개인) + 가까운 정의 우선 + 대화로 정의 등록 +- 실제 외부 DB 연결 (PostgreSQL/MySQL/DuckDB/BigQuery/Snowflake/D1, SQLAlchemy 기반) +- enrich — 실제 값 샘플 기반 컬럼 의미·관계 자동 추론 +- 안전장치 (읽기 전용, 위험 쿼리 차단), 도구 8종, 암호화 비밀 저장, SQLite 영속화 + +**아직입니다** +- 실제 사내 프로덕션 DB에 대규모 검증 (벤치마크로 추적 중) +- 자동 메타데이터 보강 고도화, 벡터 recall, URL/Notion 문서 입력, 비용 게이트 등 (V1.5+) + +--- + +## 기여하기 + +```bash +uv sync +.venv/bin/pytest -q # 전체 테스트 통과 확인 +``` +- 새 기능엔 테스트(`tests/test_.py`) 추가 +- PR은 `master` 대상, 커밋 메시지에 `feat:`/`fix:`/`docs:` prefix +- 어디를 손대면 좋은지: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) + +## 라이선스 / 커뮤니티 + +[가짜연구소](https://pseudo-lab.com/) 인과추론팀에서 개발 중. [MIT License](https://opensource.org/licenses/MIT). 💬 [Discord](https://discord.gg/EPurkHVtp2) diff --git a/README.md b/README.md index d072219..4be45ae 100644 --- a/README.md +++ b/README.md @@ -16,130 +16,156 @@ --- -> **A document-learning, read-only SQL analytics agent.** -> Feed it your company's docs → it learns your business context → it keeps a -> *separate* set of definitions per team → it answers questions over an -> incomplete database → it remembers every definition and conversation. +> **An open-source data agent that turns natural language into SQL.** +> Not on a clean, well-documented database — on the **messy real world**, where +> columns have no descriptions and every team means something different by the +> same word. -👉 **프로젝트 전체 그림(단일 SSOT)**: [`docs/PROJECT.md`](docs/PROJECT.md) · **컨트리뷰터 한눈 가이드**: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) - -This is the **v4.1 rebuild** (배경/설계 의도: [`docs/discord_first_redesign_v4_1.md`](docs/discord_first_redesign_v4_1.md)). -Where most text-to-SQL projects compete on *"generate better SQL,"* Lang2SQL -competes on everything *around* the query: business-context learning, per-team -semantics, robustness to messy databases, and memory. **Discord is the Phase 1 -interface, not the identity** — Slack/Web are adapters on the same core. +📄 한국어: [`README.ko.md`](README.ko.md) · 🧭 Full picture: [`docs/PROJECT.md`](docs/PROJECT.md) · 🏗️ Architecture: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) --- -## The four pillars - -| Pillar | What it is | -|---|---| -| **① Business-context learning** | Documents are the source of truth. Drop in a doc → the agent extracts metric/dimension/rule candidates → you confirm → they land in the semantic layer. | -| **② Two-axis robustness** | **(2a) DB robustness** — works even when columns lack descriptions (auto-enrichment, v1.5). **(2b) Semantic robustness** — teams hold *different* definitions of the same term without conflict. This axis is the product/research identity. | -| **③ Hermes memory** | Conversations, facts, and preferences persist instead of resetting each session. | -| **④ Multi-interface** | Phase 1 Discord today; Slack/Web are future adapters. No platform lock-in. | - -## Extensibility — outlets and appliances (콘센트/가전) +## In one minute -V1 ships the **simplest single implementation** of each extension point, but the -**abstraction (port) is already in place**, so v1.5/v2 add a new implementation -*without touching existing code*. Like a wall outlet: the V1 socket has one LED -bulb plugged in, but because the socket is standard, you later plug in a fan or a -smart light without rewiring the wall. +Ask the bot a question in Discord → it writes SQL, runs it, and answers. +What's different from other text-to-SQL tools isn't "question → SQL" itself — +it's everything *around* it: -Four ★ extension patterns sit behind `core/ports/`: +- **🧩 Fill empty metadata (enrich)** — even with no column descriptions, the + agent reads the *actual values* to infer what each column means and how tables + join, and writes that into the semantic layer. +- **🗂️ Per-team definitions (federation)** — the same "active customer" can mean + different things to Marketing and Finance, with no conflict. A company-wide + default sits underneath, and the **closest definition wins (member > team > company)**. +- **🛡️ Safety** — every query is checked before it runs; only reads (SELECT) are allowed. -| ★ | Pattern | Port | Grows by | -|---|---|---|---| -| ① | **Safety pipeline** | `ports/safety.py` | adding one layer class to the line (zero `run_sql` changes) | -| ② | **Memory service** | `ports/memory.py` | swapping any of 3 axes — Store / Recall / Extractor — independently | -| ③ | **Ingestion pipeline** | `ports/ingestion.py` | a Source × Extractor matrix | -| ④ | **Semantic federation** | `ports/semantic_scope.py` | git-like per-team scope branches | - -Everything outside `tenancy/concierge.py` depends only on these Protocols, so the -concrete classes (OpenAI, Postgres, SQLite) are swappable at the seams. +> Discord is the Phase 1 interface, not the identity. Slack/Web are adapters on the same core. --- -## Quickstart +## Quickstart 1 — offline demo (no token, no database) -Requires Python ≥ 3.10 and [uv](https://docs.astral.sh/uv/). +The fastest way to see the core. No Discord token, no real DB. ```bash -uv sync # create .venv and install deps +uv sync # create .venv + install deps +.venv/bin/python bench/ecommerce_demo.py # federation + safety demo ``` -### 1. Run the offline demo (no token, no database) +Shows one term resolving to two team definitions with zero conflict, and the +safety gate (DROP/INSERT blocked, SELECT passes). See [`bench/README.md`](bench/README.md). + +## Quickstart 2 — CLI (for developers) ```bash -.venv/bin/python bench/ecommerce_demo.py +.venv/bin/lang2sql "list the tables" ``` -Shows the federation money-shot (one term, two team definitions, no conflict) and -the safety gate (DROP/INSERT blocked, SELECT passes). See [`bench/README.md`](bench/README.md). +With `OPENAI_API_KEY` set it uses `gpt-4.1-mini`; otherwise the offline `FakeLLM` +(canned behavior, no real reasoning). -### 2. Run the CLI (developer driver) +--- + +## Discord bot setup (step by step) + +### 0. Prerequisites +- Python **3.10+**, [uv](https://docs.astral.sh/uv/) +- A Discord account and a server to invite the bot to +### 1. Install ```bash -.venv/bin/lang2sql "list the tables" +git clone https://github.com/CausalInferenceLab/lang2sql.git +cd lang2sql +uv sync ``` -The CLI assembles a real `HarnessContext` and runs one turn through the agent -loop. With `OPENAI_API_KEY` set it calls `gpt-4.1-mini`; otherwise it uses the -offline `FakeLLM`. +### 2. Create the Discord bot +1. [Discord Developer Portal](https://discord.com/developers/applications) → **New Application** +2. **Bot** tab → **Reset Token** → copy it (this is `DISCORD_BOT_TOKEN`) +3. Same screen → enable **Privileged Gateway Intents → MESSAGE CONTENT INTENT** (needed to read mentions) +4. **OAuth2 → URL Generator** → scopes `bot` + `applications.commands` → pick permissions → open the generated URL to **invite the bot** to your test server -### 3. Run the Discord bot +### 3. Configure environment +```bash +cp .env.example .env +``` +```ini +DISCORD_BOT_TOKEN=your_bot_token # required +OPENAI_API_KEY=sk-... # for real answers (else FakeLLM) +LANG2SQL_SECRET_KEY= # optional — Fernet key to encrypt secrets +LANG2SQL_DATA_PATH=lang2sql_data.db # optional — persistence file +LANG2SQL_SYNC_COMMANDS=true # register slash commands (/setup, ...) +LANG2SQL_DB_URL= # optional — a default DB for all channels +``` +Generate a Fernet key if you want one: +```bash +.venv/bin/python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" +``` +> ⚠️ The app does not auto-load `.env`. Load it into your shell first: +> `set -a; source .env; set +a` +### 4. Run the bot ```bash -export DISCORD_BOT_TOKEN=... # required -export OPENAI_API_KEY=... # optional; offline FakeLLM if unset -export LANG2SQL_SECRET_KEY=... # optional; Fernet key for secret encryption +set -a; source .env; set +a # load .env into the environment .venv/bin/lang2sql-bot ``` +It exits with a clear error if `DISCORD_BOT_TOKEN` is unset; otherwise it connects +to the gateway and serves. Full hosting guide: [`docs/DEPLOY.md`](docs/DEPLOY.md). + +### 5. Connect a database +Two ways: -The bot exits loudly if `DISCORD_BOT_TOKEN` is unset. Full setup and hosting: -[`docs/DEPLOY.md`](docs/DEPLOY.md). Copy [`.env.example`](.env.example) to start. +**(A) `/setup` in Discord** — a guided form for non-developers (no DSN typing). +Pick the DB type, fill the form; it tests the connection and stores credentials +encrypted. Supports **PostgreSQL · MySQL · BigQuery · Snowflake · DuckDB · Cloudflare D1**. +(DuckDB: put `/absolute/path/file.duckdb` in the path field.) + +**(B) `LANG2SQL_DB_URL`** — set before launch to point every channel at one DB: +```ini +LANG2SQL_DB_URL=postgresql://user:pw@host:5432/db +LANG2SQL_DB_URL=duckdb:////absolute/path/file.duckdb # 4 slashes = absolute path +``` +> `/connect` is a V1 stub (stores the string but does not actually connect) — use `/setup`. + +### 6. Use it +- **Ask in natural language** — mention the bot or DM it: `@Lang2SQL revenue by country` +- **`/enrich`** — auto-fill column meanings & relationships (big quality boost) +- **`/term_custom`, `/org_setup`** — define team-specific business terms --- -## What V1 does / does NOT do yet (honesty section) - -**Does:** -- 3-scope semantic federation (guild / channel / member) with most-specific-wins - resolution; `term_custom` registers definitions per scope (KV-backed). -- Safety pipeline with the V1 layers (whitelist + timeout), gating every query. -- Agent loop with eight tools: `run_sql`, `explore_schema`, `enrich_schema`, - `term_custom`, `org_setup`, `ingest_doc`, `remember`, `ask_user`. -- Memory service (in-memory store + inject-all recall + manual `/remember`). -- Discord frontend (bot, commands, session router, render). -- Encrypted-at-rest secrets (Fernet) and SQLite-backed persistence. - -**Does NOT yet:** -- **Execute against a real database.** `PostgresExplorer` is a **V1 stub** with - canned `orders`/`users` schema and sample rows; real psycopg execution is v1.5. -- **Reason without a key.** Without `OPENAI_API_KEY`, the `FakeLLM` returns - deterministic canned tool cycles — useful for wiring tests, not for answers. -- DB metadata auto-enrichment, AST-precise SQL validation, function blocklists, - cost gating, `/semantic diff` / `/semantic promote`, keyword/vector recall, - automatic fact extraction, URL/Notion ingestion — all scoped to v1.5+. -- Persist across restarts by default: the V1 `SqliteStore` defaults to in-memory; - point it at a file for durability. +## Slash commands + +| Command | What it does | +|---|---| +| `/setup` | Connect a DB via a guided form (no DSN) — **the real connection path** | +| `/enrich` | Auto-enrich column metadata (`clear:true` resets) | +| `/term_custom` | Register / show (`action:show`) / remove (`action:remove`) business terms | +| `/org_setup` | Register org (`org:`) / team (`team:`) + auto-extract terms by scanning the DB | +| `/remember` | Remember a fact for later | +| `/ingest` | Propose definitions from a document | +| `/audit_me` | Show your recent activity | +| `/connect` | (V1 stub — stores only, don't use) | + +Natural-language questions go through **mentions/DM**, not slash commands — the +agent calls the tools above itself when needed. --- -## Roadmap at a glance +## What works / what's next (honest) + +**Works** +- 3-layer federation (company / team / personal), closest-definition-wins, plus + registering definitions through conversation. +- Real external DB connections (PostgreSQL / MySQL / DuckDB / BigQuery / Snowflake / D1, via SQLAlchemy). +- `enrich` — infers column meanings & relationships from sampled real values. +- Safety pipeline (read-only, blocks risky SQL), eight tools, encrypted secrets, SQLite persistence. -| Area | V1 | V1.5 | V2 | V2.5 | -|---|---|---|---|---| -| **Safety** | whitelist + timeout | + AST validation, function blocklist, auto LIMIT, **metadata enrichment**, rate limit | + cost gate (EXPLAIN), per-engine pipelines | — | -| **Memory** | in-memory + inject-all + manual | SQLite store + keyword recall + auto-extract | + vector recall + conflict resolution | PostgreSQL + hybrid recall + confidence | -| **Ingestion** | file upload + LLM extract | + URL fetch + DDL parsing | + Notion/Confluence + hybrid | + GitHub/Drive + chunked RAG | -| **Federation** | 3-scope resolution, `/semantic show` | `/semantic diff`, `/semantic promote`, conflict alerts | git sync (semantic-as-code) | branch fork/merge UI, per-scope audit | -| **Interface** | Discord | (Anthropic/NIM eval) | Slack | Web | +**Not yet** +- Large-scale validation on a real production DB (tracked via our dirty-data benchmark). +- Deeper auto-enrichment, vector recall, URL/Notion ingestion, cost gating — scoped to v1.5+. -See [`docs/discord_first_redesign_v4_1.md`](docs/discord_first_redesign_v4_1.md) -for the full architecture write-up. +See [`docs/discord_first_redesign_v4_1.md`](docs/discord_first_redesign_v4_1.md) for the full architecture write-up. ---