(Improvement) reduced memory for metadata schema parsing (main improvement - Select only needed columns from system_schema.column )#745
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce CPU time and memory usage during schema metadata refresh by avoiding per-row dict allocations and trimming schema column queries, while also shrinking metadata object overhead.
Changes:
- Introduces an internal
_RowView+_row_factoryand routes schema query result handling through it to reduce per-row allocations. - Adds
__slots__to several metadata model classes and replaces someOrderedDictusages with plaindictto reduce memory overhead. - Narrows the
system_schema.columnsquery inSchemaParserV3to only the fields needed by the parser and refactors_build_table_columnsto classify rows in a single pass.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3a7f4bc to
4f9e4f5
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes schema metadata refresh by reducing per-row allocations during schema parsing and narrowing the system_schema.columns select list to only the fields needed for building table/column metadata.
Changes:
- Introduces an internal lightweight row representation (
_RowView+_row_factory) and uses it in schema parser result handling to reduce time/memory overhead. - Reduces the
system_schema.columnsquery to fetch only required columns and refactors_build_table_columnsto classify rows in a single pass. - Adds unit tests for
_RowViewand_row_factorybehavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| cassandra/metadata.py | Adds _RowView/_row_factory, switches schema parser row handling away from per-row dicts, tightens system_schema.columns query, and updates metadata classes/docstrings/slots. |
| tests/unit/test_metadata.py | Adds unit tests validating _RowView and _row_factory semantics (getitem/get/contains/read-only/shared index map). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
069ce0f to
eda1086
Compare
|
Warning Review limit reached
Next review available in: 58 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe change introduces Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8b0a29f to
576c932
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cassandra/metadata.py (2)
61-61: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winSort
__slots__to satisfy Ruff RUF023.Proposed fix
- __slots__ = ("_row", "_index_map") + __slots__ = ("_index_map", "_row")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cassandra/metadata.py` at line 61, The __slots__ definition in the metadata class needs to be sorted to satisfy Ruff RUF023. Update the __slots__ tuple in the class that defines _row and _index_map so the slot names are in the expected sorted order and keep the declaration consistent with the rest of the file.Source: Linters/SAST tools
56-58: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winDrop the custom
values()/items()generatorsMappingalready provides reusable view objects here; these overrides turn them into one-shot iterators and lose the usual size/reuse semantics.cassandra/metadata.py:82-85🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cassandra/metadata.py` around lines 56 - 58, The custom values() and items() generators in the Mapping implementation should be removed so the class can use the default Mapping view behavior. Update the metadata mapping class in cassandra/metadata.py by dropping the overridden values() and items() methods, and rely on the inherited reusable view objects from collections.abc.Mapping instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cassandra/metadata.py`:
- Line 61: The __slots__ definition in the metadata class needs to be sorted to
satisfy Ruff RUF023. Update the __slots__ tuple in the class that defines _row
and _index_map so the slot names are in the expected sorted order and keep the
declaration consistent with the rest of the file.
- Around line 56-58: The custom values() and items() generators in the Mapping
implementation should be removed so the class can use the default Mapping view
behavior. Update the metadata mapping class in cassandra/metadata.py by dropping
the overridden values() and items() methods, and rely on the inherited reusable
view objects from collections.abc.Mapping instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: db48b541-ca15-4faa-92a1-9a2e6fd2a764
📒 Files selected for processing (2)
cassandra/metadata.pytests/unit/test_metadata.py
Introduce _RowView, a __slots__-based read-only row wrapper that stores data as tuples with a shared column-name-to-index map, and _row_factory that creates these views. _RowView inherits from collections.abc.Mapping, providing a complete dict-like read interface. This eliminates per-row dict allocation during schema parsing. All rows from the same result set share a single index map object.
Python 3.7+ guarantees dict preserves insertion order, making OrderedDict
unnecessary. Replace OrderedDict() with {} in TableMetadata.columns,
TableMetadata.triggers, and MaterializedViewMetadata.columns. Remove the
now-unused OrderedDict import.
….columns Replace SELECT * with an explicit column list for the system_schema.columns query in SchemaParserV3 (inherited by V4). Only the 7 columns actually consumed by the parser are fetched: keyspace_name, table_name, column_name, clustering_order, kind, position, type. This reduces network transfer and deserialization overhead during schema refresh.
Replace dict_factory in _SchemaParser._handle_results and get_column_from_system_local with _row_factory, eliminating per-row dict allocation during schema parsing. Also refactor SchemaParserV4._build_keyspace_metadata_internal to read from the row without mutating it, since _RowView is read-only. Note: V22-only dict_factory call sites are left unchanged as they do not affect the V3/V4 code path (V3 and V4 fully override _query_all).
Replace three list comprehension passes over col_rows with a single classification loop that sorts columns into partition, clustering, and other buckets. Also use in-place sort() instead of sorted() and reuse the already-built column_meta instead of a redundant dict lookup.
Cover __getitem__, get(), __contains__, __repr__, shared index map, read-only enforcement, empty input, single-column, and multi-row scenarios.
576c932 to
ef455ae
Compare
Reduce the time and memory required to parse schema metadata when refreshing. The key changes are:
system_schema.columns— the biggest network/memory win for large schemasdict_factorywith_RowView— lightweight tuple-backed Mapping that shares a single index map across all rows (5.7x memory reduction per row)__slots__on all metadata classes — no__dict__overhead, saving ~264 bytes perColumnMetadatainstance (~516 KB for 2000-column schemas)_build_table_columns— avoids repeated list-filtering passesOrderedDictwithdict— Python 3.7+ guarantees insertion orderPerformance results (Python 3.14, rebased on master)
_RowViewnow implements the fullcollections.abc.Mappingprotocol (keys(),values(),items(),len(), iteration).Pre-review checklist
./docs/source/.Fixes:annotations to PR description.