fix: block unsafe tar extraction paths#647
Conversation
📝 WalkthroughWalkthroughThis PR hardens tar extraction in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_model_management.py (1)
42-54: ⚡ Quick winConsider adding a hardlink path traversal test for complete coverage.
The
_validate_tar_membermethod handles both symlinks (issym()) and hardlinks (islnk()) with different resolution logic, but only symlinks are tested. Adding a hardlink test would ensure the hardlink-specific branch is covered.Suggested additional test
def test_decompress_to_cache_rejects_hardlink_path_traversal(tmp_path): archive_path = tmp_path / "model.tar.gz" cache_dir = tmp_path / "cache" cache_dir.mkdir() with tarfile.open(archive_path, "w:gz") as tar: link = tarfile.TarInfo(name="model/escape") link.type = tarfile.LNKTYPE link.linkname = "../outside.txt" tar.addfile(link) with pytest.raises(ValueError, match="Unsafe tar link target"): ModelManagement.decompress_to_cache(str(archive_path), str(cache_dir))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_model_management.py` around lines 42 - 54, Add a new unit test mirroring test_decompress_to_cache_rejects_symlink_path_traversal but creating a TarInfo with type tarfile.LNKTYPE (hardlink) and a linkname that points outside the cache (e.g., "../outside.txt"), then call ModelManagement.decompress_to_cache and assert it raises ValueError with "Unsafe tar link target"; this ensures the hardlink branch in _validate_tar_member / ModelManagement.decompress_to_cache is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fastembed/common/model_management.py`:
- Around line 317-323: The exception handler currently catches
(tarfile.TarError, ValueError) as e and re-raises a new ValueError, losing the
original traceback; update the except block in model_management.py (the block
that checks "tmp" in cache_dir and removes cache_dir) to re-raise the new
ValueError using exception chaining (i.e., use "raise ValueError(f'An error
occurred while decompressing {targz_path}: {e}') from e") so the original
exception `e` is preserved for debugging.
---
Nitpick comments:
In `@tests/test_model_management.py`:
- Around line 42-54: Add a new unit test mirroring
test_decompress_to_cache_rejects_symlink_path_traversal but creating a TarInfo
with type tarfile.LNKTYPE (hardlink) and a linkname that points outside the
cache (e.g., "../outside.txt"), then call ModelManagement.decompress_to_cache
and assert it raises ValueError with "Unsafe tar link target"; this ensures the
hardlink branch in _validate_tar_member / ModelManagement.decompress_to_cache is
exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 10847843-20ac-49f6-9e60-4453391d48a0
📒 Files selected for processing (2)
fastembed/common/model_management.pytests/test_model_management.py
| except (tarfile.TarError, ValueError) as e: | ||
| # If any error occurs while opening or extracting the tar.gz file, | ||
| # delete the cache directory (if it was created in this function) | ||
| # and raise the error again | ||
| if "tmp" in cache_dir: | ||
| if "tmp" in cache_dir and os.path.exists(cache_dir): | ||
| shutil.rmtree(cache_dir) | ||
| raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") |
There was a problem hiding this comment.
Chain the original exception for better debugging.
The raised ValueError loses the original exception context. Use raise ... from e to preserve the full traceback chain.
Proposed fix
- raise ValueError(f"An error occurred while decompressing {targz_path}: {e}")
+ raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except (tarfile.TarError, ValueError) as e: | |
| # If any error occurs while opening or extracting the tar.gz file, | |
| # delete the cache directory (if it was created in this function) | |
| # and raise the error again | |
| if "tmp" in cache_dir: | |
| if "tmp" in cache_dir and os.path.exists(cache_dir): | |
| shutil.rmtree(cache_dir) | |
| raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") | |
| except (tarfile.TarError, ValueError) as e: | |
| # If any error occurs while opening or extracting the tar.gz file, | |
| # delete the cache directory (if it was created in this function) | |
| # and raise the error again | |
| if "tmp" in cache_dir and os.path.exists(cache_dir): | |
| shutil.rmtree(cache_dir) | |
| raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") from e |
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 323-323: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fastembed/common/model_management.py` around lines 317 - 323, The exception
handler currently catches (tarfile.TarError, ValueError) as e and re-raises a
new ValueError, losing the original traceback; update the except block in
model_management.py (the block that checks "tmp" in cache_dir and removes
cache_dir) to re-raise the new ValueError using exception chaining (i.e., use
"raise ValueError(f'An error occurred while decompressing {targz_path}: {e}')
from e") so the original exception `e` is preserved for debugging.
Source: Linters/SAST tools
Summary
Fixes #626.
decompress_to_cache()usedtar.extractall()directly, so a crafted archive member such as../outside.txtcould write outside the target cache directory. This validates every tar member before extraction and keeps Python 3.12+'sfilter="data"protection when available.The guard covers both member paths and tar links:
cache_dircache_dirValidation