Skip to content

feat(scraper): add Azure Foundry scraper for OpenAI and Anthropic models#1

Open
pranav671 wants to merge 14 commits into
quora:masterfrom
pranav671:master
Open

feat(scraper): add Azure Foundry scraper for OpenAI and Anthropic models#1
pranav671 wants to merge 14 commits into
quora:masterfrom
pranav671:master

Conversation

@pranav671

Copy link
Copy Markdown

Summary

This PR adds a new scraper for the Azure Foundry model deprecation schedule, specifically tracking Azure OpenAI and Anthropic (partner/community) models.

It scrapes the official Azure Foundry OpenAI Model Retirement Schedule and extracts model names, versions, lifecycle statuses, retirement/shutdown dates, and replacement recommendations.


Motivation

I wanted to have model deprecations tracked for my application, and this repository has been excellent for that purpose. However, it was missing Azure Foundry (OpenAI & Anthropic) tracking, so I implemented this scraper to bridge the gap.


Key Implementation Details

  • Target Filtering: Standardizes collection to only process standard Azure OpenAI (h3#azure-openai) and Anthropic (h3#anthropic) models. All other provider tables (Cohere, Mistral, Meta, etc.) are intentionally ignored.
  • Exclusions: Explicitly excludes fine-tuned models, OSS models, and community models from other providers.
  • Section Termination: Sibling iteration stops immediately at any heading tag to prevent nested tables (like h4#fine-tuned-models) from leaking into standard model results.
  • Model Standardisation: Combines the Model and Version columns into the standard Model (Version) format when a version is present.
  • Garbled Encoding & Empty-Value Protection: Implements a robust _is_empty_value() helper that normalizes all dash variants (ASCII hyphen, en dash , em dash ) and strips garbled UTF-8 to Latin-1 decoding artifacts (e.g. â + control characters arising from em dash mis-decoding). This prevents garbage strings from being appended to model names or appearing in the replacement column.
  • Status Mapping: Automatically maps Azure's lifecycle terms (GA/Preview to active, Deprecated to deprecated, Retired to retired) and auto-promotes any entry's status to retired if the shutdown date is already in the past.

Verification & Tests

  • Workflow Run: Successfully executed the automatic README update workflow in GitHub Actions: Workflow Run #27218233737.
  • Workflow Commit: Verified that the generated table aligns perfectly with the repository's format: Commit ad94712.
  • Scraper Integration Tests: Added TestAzureFoundryScraper in tests/test_scrapers.py covering 8 integration scenarios (standard OpenAI parsing, Anthropic parsing, section bounds, status mapping, and exclusions) utilizing a local HTML fixture (tests/fixtures/azure_foundry.html).
  • Encoding & Helper Tests: Added TestIsEmptyValue with 9 unit tests to explicitly verify that various empty, dash, and encoding-corrupted values (like â artifacts) are handled and stripped correctly.

Future Thoughts

  • Refactoring Scrapers: Planning a future PR to refactor existing scrapers by moving shared helper functions (like date parsing, HTML table column matching, and empty-cell normalization) to a common utilities module under the scraper/ folder to clean up duplication.

Scrapes the Microsoft Foundry model retirement schedule page and extracts
deprecation/retirement entries for two model families:
- Azure OpenAI  (h3#azure-openai)
- Anthropic     (h3#anthropic, partner/community section)

All other families (Cohere, Meta, Mistral, xAI, fine-tuned, OSS, etc.)
are intentionally excluded.

Key implementation details:
- Combines Model + Version columns into model_name as 'Model (Version)'
  when a real version is present; omits version when cell is empty/dash
- Maps Lifecycle column (GA/Preview → active, Deprecated → deprecated,
  Retired → retired) and auto-promotes status to 'retired' when the
  retirement date is already in the past
- Stops sibling iteration at ANY heading tag to prevent sub-section
  tables (e.g. h4#fine-tuned-models) from leaking into results
- _is_empty_value() helper normalises all dash variants (ASCII hyphen,
  en/em dash U+2013/U+2014) and UTF-8→Latin-1 encoding artefacts
  (e.g. 'â' + control chars from mis-decoded em dash) so that empty
  version/replacement cells are never appended to model names

Adds:
- scraper/azure_foundry_scraper.py
- tests/fixtures/azure_foundry.html
- TestIsEmptyValue (9 unit tests for the encoding-artefact guard)
- TestAzureFoundryScraper (8 integration tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant