Skip to content

fix: block unsafe tar extraction paths#647

Open
he-yufeng wants to merge 1 commit into
qdrant:mainfrom
he-yufeng:fix/safe-tar-extract
Open

fix: block unsafe tar extraction paths#647
he-yufeng wants to merge 1 commit into
qdrant:mainfrom
he-yufeng:fix/safe-tar-extract

Conversation

@he-yufeng

Copy link
Copy Markdown

Summary

Fixes #626.

decompress_to_cache() used tar.extractall() directly, so a crafted archive member such as ../outside.txt could write outside the target cache directory. This validates every tar member before extraction and keeps Python 3.12+'s filter="data" protection when available.

The guard covers both member paths and tar links:

  • file/directory members must resolve inside cache_dir
  • symlink and hardlink targets must also resolve inside cache_dir
  • unsafe archives fail before anything is extracted

Validation

python -m pytest tests/test_model_management.py -q
python -m ruff check fastembed/common/model_management.py tests/test_model_management.py
python -m ruff format --check fastembed/common/model_management.py tests/test_model_management.py
python -m py_compile fastembed/common/model_management.py tests/test_model_management.py
git diff --check

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR hardens tar extraction in ModelManagement.decompress_to_cache against path traversal attacks (Zip Slip / Tar Slip). A new _validate_tar_member helper validates that extracted member paths and symlink targets remain within the cache directory. The main method now enumerates members, validates each, attempts Python 3.12+ filter='data' with fallback for older versions, and improves error handling. Comprehensive tests verify safe extraction and reject path traversal via both member names and symlink targets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: block unsafe tar extraction paths' clearly and concisely summarizes the main security fix: preventing path traversal attacks in tar extraction.
Description check ✅ Passed The description is directly related to the changeset, explaining the tar path traversal vulnerability fix, validation approach, and including verification steps.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #626 by implementing path traversal validation for tar members and symlink targets, with fallback for older Python versions and extensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the tar path traversal vulnerability: validation logic, error handling improvements, and comprehensive security tests with no extraneous modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_model_management.py (1)

42-54: ⚡ Quick win

Consider adding a hardlink path traversal test for complete coverage.

The _validate_tar_member method handles both symlinks (issym()) and hardlinks (islnk()) with different resolution logic, but only symlinks are tested. Adding a hardlink test would ensure the hardlink-specific branch is covered.

Suggested additional test
def test_decompress_to_cache_rejects_hardlink_path_traversal(tmp_path):
    archive_path = tmp_path / "model.tar.gz"
    cache_dir = tmp_path / "cache"
    cache_dir.mkdir()

    with tarfile.open(archive_path, "w:gz") as tar:
        link = tarfile.TarInfo(name="model/escape")
        link.type = tarfile.LNKTYPE
        link.linkname = "../outside.txt"
        tar.addfile(link)

    with pytest.raises(ValueError, match="Unsafe tar link target"):
        ModelManagement.decompress_to_cache(str(archive_path), str(cache_dir))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_model_management.py` around lines 42 - 54, Add a new unit test
mirroring test_decompress_to_cache_rejects_symlink_path_traversal but creating a
TarInfo with type tarfile.LNKTYPE (hardlink) and a linkname that points outside
the cache (e.g., "../outside.txt"), then call
ModelManagement.decompress_to_cache and assert it raises ValueError with "Unsafe
tar link target"; this ensures the hardlink branch in _validate_tar_member /
ModelManagement.decompress_to_cache is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fastembed/common/model_management.py`:
- Around line 317-323: The exception handler currently catches
(tarfile.TarError, ValueError) as e and re-raises a new ValueError, losing the
original traceback; update the except block in model_management.py (the block
that checks "tmp" in cache_dir and removes cache_dir) to re-raise the new
ValueError using exception chaining (i.e., use "raise ValueError(f'An error
occurred while decompressing {targz_path}: {e}') from e") so the original
exception `e` is preserved for debugging.

---

Nitpick comments:
In `@tests/test_model_management.py`:
- Around line 42-54: Add a new unit test mirroring
test_decompress_to_cache_rejects_symlink_path_traversal but creating a TarInfo
with type tarfile.LNKTYPE (hardlink) and a linkname that points outside the
cache (e.g., "../outside.txt"), then call ModelManagement.decompress_to_cache
and assert it raises ValueError with "Unsafe tar link target"; this ensures the
hardlink branch in _validate_tar_member / ModelManagement.decompress_to_cache is
exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10847843-20ac-49f6-9e60-4453391d48a0

📥 Commits

Reviewing files that changed from the base of the PR and between 8a8ea4f and c539bcc.

📒 Files selected for processing (2)
  • fastembed/common/model_management.py
  • tests/test_model_management.py

Comment on lines +317 to 323
except (tarfile.TarError, ValueError) as e:
# If any error occurs while opening or extracting the tar.gz file,
# delete the cache directory (if it was created in this function)
# and raise the error again
if "tmp" in cache_dir:
if "tmp" in cache_dir and os.path.exists(cache_dir):
shutil.rmtree(cache_dir)
raise ValueError(f"An error occurred while decompressing {targz_path}: {e}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Chain the original exception for better debugging.

The raised ValueError loses the original exception context. Use raise ... from e to preserve the full traceback chain.

Proposed fix
-            raise ValueError(f"An error occurred while decompressing {targz_path}: {e}")
+            raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except (tarfile.TarError, ValueError) as e:
# If any error occurs while opening or extracting the tar.gz file,
# delete the cache directory (if it was created in this function)
# and raise the error again
if "tmp" in cache_dir:
if "tmp" in cache_dir and os.path.exists(cache_dir):
shutil.rmtree(cache_dir)
raise ValueError(f"An error occurred while decompressing {targz_path}: {e}")
except (tarfile.TarError, ValueError) as e:
# If any error occurs while opening or extracting the tar.gz file,
# delete the cache directory (if it was created in this function)
# and raise the error again
if "tmp" in cache_dir and os.path.exists(cache_dir):
shutil.rmtree(cache_dir)
raise ValueError(f"An error occurred while decompressing {targz_path}: {e}") from e
🧰 Tools
🪛 Ruff (0.15.15)

[warning] 323-323: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fastembed/common/model_management.py` around lines 317 - 323, The exception
handler currently catches (tarfile.TarError, ValueError) as e and re-raises a
new ValueError, losing the original traceback; update the except block in
model_management.py (the block that checks "tmp" in cache_dir and removes
cache_dir) to re-raise the new ValueError using exception chaining (i.e., use
"raise ValueError(f'An error occurred while decompressing {targz_path}: {e}')
from e") so the original exception `e` is preserved for debugging.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory

1 participant