Skip to content

FEAT: Add ActiveDirectoryMSI support for bulk copy#573

Open
bewithgaurav wants to merge 11 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi
Open

FEAT: Add ActiveDirectoryMSI support for bulk copy#573
bewithgaurav wants to merge 11 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi

Conversation

@bewithgaurav
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav commented May 12, 2026

Work Item / Issue Reference

AB#45137

GitHub Issue: #534


Summary

Adds Authentication=ActiveDirectoryMSI support to bulk copy. Partial fix for #534.

  • Zero-arg ManagedIdentityCredential() for system-assigned MSI.
  • ManagedIdentityCredential(client_id=UID) for user-assigned MSI. Matches ODBC convention where UID carries the identity's client_id under MSI.
  • Threads optional credential_kwargs through get_auth_token / get_raw_token / _acquire_token so future auth methods that need constructor args plug in via the same channel.
  • Credential cache key stays a plain string for zero-arg auth types and becomes a tuple (auth_type, sorted_kwargs) when kwargs are present, so different client_ids get separate cached credentials.

mssql-py-core (the Rust path used by bulk copy) doesn't itself acquire Entra tokens. Python pre-acquires a JWT and passes it as access_token. Today this works for Default, DeviceCode, and Interactive. MSI was missing, the most common Azure-hosted-service auth method.

ServicePrincipal and Password are explicitly out of scope for this PR. They need different design work and ship separately.

How user-assigned MSI survives the bulkcopy fresh-token path

Connection.__init__ calls process_connection_string, which sanitizes self.connection_str (strips UID=, Authentication=, PWD=). Bulkcopy needs a fresh token at copy time. Re-parsing self.connection_str at that point would lose UID and silently degrade user-assigned MSI to system-assigned (wrong-identity bug).

Fix shape: process_connection_string returns a 4-tuple including the captured credential_kwargs. Connection.__init__ persists _credential_kwargs alongside _auth_type. cursor.bulkcopy() reads self.connection._credential_kwargs directly. No re-parse, no chance for the sanitized string to drop the client_id.

Connection string examples

# System-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;Database=mydb;"

# User-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;UID=<client_id_guid>;Database=mydb;"

Adds Authentication=ActiveDirectoryMSI to the auth pipeline:

- Zero-arg ManagedIdentityCredential() for system-assigned MSI.
- ManagedIdentityCredential(client_id=UID) for user-assigned MSI,
  matching ODBC's convention where UID carries the identity's
  client_id under MSI.
- Threads optional credential_kwargs through get_auth_token /
  get_raw_token / _acquire_token so future auth methods that need
  constructor args (e.g. ClientSecretCredential) can plug in via
  the same channel.
- Cache key remains a plain string for zero-arg auth types and
  becomes a tuple when kwargs are present, so different client_ids
  get separate cached credentials.

Partial fix for #534. ServicePrincipal and
Password to follow as separate PRs.
Copilot AI review requested due to automatic review settings May 12, 2026 12:09
@bewithgaurav bewithgaurav changed the title Add ActiveDirectoryMSI support for bulk copy FIX: Add ActiveDirectoryMSI support for bulk copy May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for Authentication=ActiveDirectoryMSI (managed identity) token acquisition for the bulk copy (mssql-py-core) path, including user-assigned MSI via UID=<client_id>.

Changes:

  • Add AuthType.MSI (activedirectorymsi) and map it to ManagedIdentityCredential.
  • Thread optional credential_kwargs through token acquisition and update the credential-instance cache key to incorporate kwargs.
  • Attempt to extract MSI client_id from the connection string and pass it into bulk copy token acquisition; add tests and a changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mssql_python/constants.py Adds AuthType.MSI enum value.
mssql_python/auth.py Adds MSI credential support, credential kwargs plumbing, and cache key changes; adds extract_credential_kwargs.
mssql_python/cursor.py Bulk copy now tries to extract MSI credential kwargs and pass them to get_raw_token.
tests/test_008_auth.py Adds tests for MSI auth type parsing, credential construction, cache isolation, and connection-string processing.
CHANGELOG.md Documents MSI support for bulk copy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/cursor.py Outdated
Comment thread mssql_python/auth.py Outdated
Comment thread tests/test_008_auth.py
@bewithgaurav bewithgaurav changed the title FIX: Add ActiveDirectoryMSI support for bulk copy FEAT: Add ActiveDirectoryMSI support for bulk copy May 13, 2026
@bewithgaurav bewithgaurav marked this pull request as draft May 13, 2026 12:15
Connection.__init__ overwrites self.connection_str with the sanitized
(UID-stripped) string returned by process_connection_string. The original
implementation re-parsed self.connection_str at bulkcopy time via
extract_credential_kwargs, which silently dropped the user-assigned MSI
client_id and degraded to system-assigned  a wrong-identity bug.MSI

Changes:

- process_connection_string now returns a 4-tuple including the captured
  credential_kwargs so callers can persist them.
- Connection.__init__ stores _credential_kwargs alongside _auth_type.
- cursor.bulkcopy() reads self.connection._credential_kwargs instead of
  re-parsing self.connection_str.
- The public extract_credential_kwargs helper is removed (it only existed
  to support the broken re-parse path; nothing else needs it).
- black --line-length=100 reformats (CI was red).

Tests:

- test_bulkcopy_path_preserves_user_assigned_msi_client_id: invokes
  cursor.bulkcopy() with a mocked mssql_py_core, patches
  AADAuth.get_raw_token to capture the args it receives, and asserts
  the captured credential_kwargs match Connection._credential_kwargs.
  Fails if cursor reverts to re-parsing self.connection.connection_str.
- test_credential_kwargs_persisted_for_user_assigned_msi: asserts
  Connection.__init__ stores _credential_kwargs from the 4-tuple.
- test_credential_kwargs_none_for_system_assigned_msi.
- test_credential_kwargs_none_for_non_msi_auth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bewithgaurav bewithgaurav marked this pull request as ready for review May 14, 2026 12:31
@github-actions github-actions Bot added the pr-size: medium Moderate update size label May 14, 2026
Comment thread mssql_python/auth.py Outdated
bewithgaurav and others added 3 commits May 14, 2026 18:40
Per @saurabh500's review: braced ODBC values like UID={hello=world} need
the canonical parser, not naive partition('='). Without this, the helper
returns '{hello=world}' verbatim and ManagedIdentityCredential rejects it.
Worse, a UID containing a literal ';' would be truncated.

_extract_msi_client_id now delegates to _ConnectionStringParser, which
handles braces, escaped '}}' inside braces, and '=' inside braced values
correctly. validate_keywords=False so the helper never raises on keys
the auth flow doesn't care about.

Tests:
- test_msi_braced_uid_value_is_unwrapped: UID={hello=world} -> 'hello=world'
- test_msi_braced_uid_with_semicolon_is_preserved: UID={abc;def;ghi}

Note: process_connection_string and extract_auth_type still use naive
split(';') for Authentication= detection across all Entra ID auth types.
That's pre-existing and tracked separately for a parser-wide refactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mssql_python/auth.py
Comment thread mssql_python/auth.py
- Debug log distinguishing user-assigned vs system-assigned MSI when the
  user passes Authentication=ActiveDirectoryMSI. Helps diagnose which
  branch was taken when token acquisition fails. Logs client_id length,
  not value (still identity material).
- Comment above _credential_cache explains the cache key shape so the
  unbounded growth is understood as a deliberate choice rather than an
  oversight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

📊 Code Coverage Report

🔥 Diff Coverage

100%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 7039 out of 27155
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/auth.py (100%)
  • mssql_python/connection.py (100%)
  • mssql_python/constants.py (100%)

Summary

  • Total: 34 lines
  • Missing: 0 lines
  • Coverage: 100%

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Connection.__init__ already parses the same connection string through
_ConnectionStringParser via _construct_connection_string (connection.py
line 253) before process_connection_string is ever called. By the time
_extract_msi_client_id runs, the input is guaranteed parseable.

The try/except was dead code. A real parse failure here would indicate
an upstream bug and should propagate, not silently degrade user-assigned
MSI to system-assigned (which is the wrong-identity failure mode this PR
exists to prevent).

Brings diff coverage to 100%.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
saurabh500
saurabh500 previously approved these changes May 14, 2026
Comment thread mssql_python/auth.py
# time we get here the input is guaranteed parseable. No defensive
# try/except: a parse failure now means a real bug upstream and should
# propagate, not silently degrade user-assigned MSI to system-assigned.
parsed = _ConnectionStringParser(validate_keywords=False)._parse(connection_string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally correct but a tiny perf hit due to double parsing of connections string.
Perfect solution would require parsing done earlier in the connect api and passing around the map to this function. That's a bigger change. I recommend a follow up PR and tracking with a GH issue.

An adhoc fix is to maintain a hashmap of connection string and uid. But that's prone to other problems esp concurrency.

bewithgaurav added a commit that referenced this pull request May 15, 2026
…tructure

Brings in the ActiveDirectoryMSI feature (PR #573, mostly approved) so SP
can use the shared module-level credential cache that PR #573 introduces,
instead of the closure-scoped cache the SP commit shipped with.

Conflicts resolved (kept both contributions):
- mssql_python/constants.py: AuthType.MSI + AuthType.SERVICE_PRINCIPAL
- mssql_python/auth.py: _parse_tenant_id + ServicePrincipalAuth +
  _extract_msi_client_id; both elif branches in process_auth_parameters;
  both keys in extract_auth_type's auth_map
- mssql_python/cursor.py: SP factory branch on _auth_type=='serviceprincipal',
  Model A branch threads credential_kwargs through AADAuth.get_raw_token
  (MSI's contract change)
- tests/test_008_auth.py: MockClientSecretCredential + MockManagedIdentityCredential
  in setup_azure_identity, both sets of TestProcessAuthParameters /
  TestExtractAuthType cases

ServicePrincipalAuth.make_token_factory now uses the shared
_credential_cache + _credential_cache_key (introduced for MSI), keyed by
('serviceprincipal', tenant_id, client_id). Closure-scoped cache and the
TODO(#573-followup) note removed. client_secret is intentionally NOT in
the cache key because credentials are looked up by identity, not secret;
a rotated secret will surface as ClientAuthenticationError, not a silent
hit on the wrong credential.

Tests: 86 passed in tests/test_008_auth.py (was 75 SP-only). Black clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# On Windows Interactive, process_connection_string returns None
# (DDBC handles auth natively), so fall back to the connection string.
self._auth_type = connection_result[2] or extract_auth_type(self.connection_str)
self._credential_kwargs = connection_result[3]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_connection_string changed from 3-tuple to 4-tuple return. This is internal so it's fine, but worth noting in the docstring that callers must unpack all 4. If a third-party or test is doing a, b, c = process_connection_string(...), they'll get a ValueError with no obvious cause. The existing tests were all updated - just flagging for awareness.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. the 4-tuple is interim. I have filed #580 to refactor process_connection_string to take the parsed connection-string map instead of the raw string (also avoids the 3x parse on the connect path). once that lands, the 4-tuple goes away and callers move to dict access.

for now, added a short docstring note about the 4-tuple to cover anyone reaching in before #580 lands.

Tracking refactor (parse-once, thread the parsed map through the auth
path) is a separate follow-up; this docstring helps anyone reaching
into the function before that lands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants