Skip to content

feat: discover and parse project-maintained affiliations (CM-361)#4280

Open
skwowet wants to merge 23 commits into
mainfrom
feat/CM-361-part-1
Open

feat: discover and parse project-maintained affiliations (CM-361)#4280
skwowet wants to merge 23 commits into
mainfrom
feat/CM-361-part-1

Conversation

@skwowet

@skwowet skwowet commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds processing for project-maintained contributor-to-employer files inside git_integration.

Many LF projects publish who works for which company in repo files (gitdm, .organizationmap, SIG YAML, etc.). Formats and filenames vary, so we reuse the existing clone pipeline: find the file on disk, parse it with AI into a normalized snapshot, track per-repo state, and resolve rows against CDP members and organizations.

The worker runs this on the first clone batch, alongside maintainer processing. Each repo gets one registry row (git.repoAffiliationRegistry) plus a serviceExecutions record with operation type Affiliation.

Run flow

Interval elapsed?
  no  → skip run
  yes ↓

Resolve file path
  saved path still on disk? → use it
  else discover:
    list text-like files at repo root → AI picks from candidates (path + 400 char preview per file)
  ↓

Parse
  file hash unchanged and snapshot valid? → reuse snapshot (no AI)
  else → AI extract flat rows, group by contributor into snapshot
  ↓

Fork? → drop stints already on parent repo's cached snapshot

Apply (dry run in this PR)
  for each stint: lookup member + org, run guards, build MO/MSA inserts
  actual INSERT calls are commented out
  ↓

Update registry (filePath, fileHash, status, snapshot) + serviceExecutions

Parsed snapshot shape

Each snapshot entry is one contributor with their organization stints over time:

  • Contributor: email or GitHub username required (email preferred for grouping and lookup). Name is optional.
  • Organizations: each stint has domain (required for CDP lookup), optional name, optional dateStart/dateEnd, and isUnaffiliated for independent/unaffiliated rows (normalized to individual-noaccount.com). AI returns flat contributor–org rows; code groups by contributor and dedupes stints. Rows missing email/github or domain are dropped during normalization.

Registry status: success (including zero-row parses), not_found, unusable (parsed but nothing applyable; stable until file content changes), error (unexpected parse failure; retry on next interval).

Apply rules (implemented, writes disabled for now)

  • Lookup only: no find-or-create for members or orgs. Unresolved rows are skipped and retried on the next run using the same snapshot.
  • Member identity: email first (git username identity), else GitHub username.
  • Org identity: verified primary-domain only (no display-name lookup).
  • Guards: skip MO/MSA insert when the same member + org + date range already exists, or an undated/open-ended row already covers an undated insert (MSA scoped to the repo segment).
  • MO source would be project-registry when inserts are enabled.

Repos without a segment skip apply. Unexpected parse failures set registry status to error, clear fileHash, and keep the previous snapshot so the next interval run retries parsing.

Changes

AffiliationService (services/affiliation/affiliation_service.py)

  • Root-level text file discovery; AI file picker on all candidates (no known-filename list or ripgrep). Batches candidates (20 per call) with short file previews; rejects scripts/governance/credits in the picker prompt.
  • AI extraction returns flat rows with optional dates and unaffiliated markers; group_parse_rows merges into contributor + organizations[] snapshot shape.
  • AI extraction with chunked parsing for files over 5k chars.
  • SHA-gated snapshot caching to avoid re-parsing unchanged files.
  • Separate update interval (24h when a file is found) and retry interval (30d when not found).
  • Fork repos: filter stints already present on parent repo snapshot before apply.
  • Full apply path: bulk identity resolution, dedupe, stint-level guard checks, insert list building. MO/MSA executemany calls commented out with a TODO.

Worker wiring (repository_worker.py, server.py)

  • AffiliationService registered and invoked on is_first_batch after maintainer processing.

Database layer (crud.py)

  • Registry get/upsert and snapshot JSON serialize/deserialize via Pydantic.
  • Bulk member/org identity lookups (single query per entity type, index-aligned results).
  • Fetch existing MO and segment-scoped MSA for guards.
  • Insert helpers for MO (ON CONFLICT DO NOTHING aligned to partial unique indexes) and MSA.

Shared infrastructure

  • Bedrock client moved from maintainer/bedrock.py to services/llm/bedrock.py; maintainer import updated.
  • New Pydantic models (affiliation_info.py), error types, AffiliationRegistryStatus, OperationType.AFFILIATION, and AFFILIATION_* settings in settings.py / conftest.py.

Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
@skwowet skwowet requested a review from mbani01 June 30, 2026 09:59
@skwowet skwowet self-assigned this Jun 30, 2026
Copilot AI review requested due to automatic review settings June 30, 2026 09:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new AffiliationService to the git_integration worker pipeline to discover and AI-parse project-maintained contributor→employer/organization mapping files, cache parsed snapshots per repo, and (currently) dry-run the apply/guard logic while recording service executions and registry state.

Changes:

  • Added AffiliationService and wired it into RepositoryWorker/server lifespan for first-clone batches.
  • Added Bedrock LLM client module under services/llm/ and updated maintainer processing to import it.
  • Extended the git_integration DB CRUD layer with affiliation registry read/upsert plus bulk identity lookup helpers.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/apps/git_integration/src/test/conftest.py Adds default env vars for affiliation retry/update intervals in tests.
services/apps/git_integration/src/crowdgit/worker/repository_worker.py Injects and invokes AffiliationService in the repository processing pipeline.
services/apps/git_integration/src/crowdgit/settings.py Adds AFFILIATION_* interval settings.
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py Switches Bedrock import to the new shared LLM module.
services/apps/git_integration/src/crowdgit/services/llm/bedrock.py New shared Bedrock invocation helper (retry, JSON extraction, pydantic validation, cost calc).
services/apps/git_integration/src/crowdgit/services/llm/init.py Initializes the new llm service package.
services/apps/git_integration/src/crowdgit/services/affiliation/affiliation_service.py Implements discovery, AI parsing (including chunking), snapshot caching, and dry-run apply/guards.
services/apps/git_integration/src/crowdgit/services/init.py Exports AffiliationService for DI/wiring.
services/apps/git_integration/src/crowdgit/server.py Instantiates and passes AffiliationService into RepositoryWorker.
services/apps/git_integration/src/crowdgit/models/affiliation_info.py Adds pydantic models for file-picker and parsed affiliation snapshot rows.
services/apps/git_integration/src/crowdgit/errors.py Adds new affiliation-specific error types.
services/apps/git_integration/src/crowdgit/enums.py Adds new error codes, registry status enum, and OperationType.REPO_AFFILIATION.
services/apps/git_integration/src/crowdgit/database/crud.py Adds registry CRUD + snapshot (de)serialization + bulk identity lookup helpers + insert helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/git_integration/src/crowdgit/errors.py
Comment thread services/apps/git_integration/src/crowdgit/errors.py Outdated
@skwowet skwowet changed the title feat: discover and parse project-maintained affiliations feat: discover and parse project-maintained affiliations (CM-361) Jun 30, 2026
skwowet added 2 commits June 30, 2026 16:05
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 30, 2026 10:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings June 30, 2026 10:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
skwowet added 2 commits June 30, 2026 16:50
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 30, 2026 12:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
skwowet added 2 commits July 1, 2026 00:01
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 30, 2026 18:31
skwowet added 2 commits July 1, 2026 00:07
…de text file extensions

Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
@skwowet skwowet requested a review from mbani01 June 30, 2026 18:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
skwowet added 2 commits July 1, 2026 21:51
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 1, 2026 17:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.

skwowet added 2 commits July 1, 2026 23:16
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 1, 2026 17:48
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings July 1, 2026 18:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings July 2, 2026 09:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
skwowet added 2 commits July 2, 2026 22:30
Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
…ntity

Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 2, 2026 17:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.

Comment thread services/apps/git_integration/src/crowdgit/models/affiliation_info.py Outdated
skwowet added 2 commits July 2, 2026 23:23
…pe and add date parsing util

Signed-off-by: Yeganathan S <63534555+skwowet@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 2, 2026 17:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 2 comments.

Comment on lines +352 to +356
def _parse_optional_date(value: str | None) -> date | None:
stripped = AffiliationService._strip(value)
if not stripped:
return None
return date.fromisoformat(stripped)
Comment on lines +605 to +619
for idx, identity in enumerate(identities):
values_parts.append(
f"(${param_index}::int, ${param_index + 1}::text, ${param_index + 2}::boolean,"
f" ${param_index + 3}::text, ${param_index + 4}::text)"
)
params.extend(
[
idx,
identity["type"],
identity.get("verified", True),
identity.get("platform"),
identity["value"],
]
)
param_index += 5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants