feat(scraper): add Azure Foundry scraper for OpenAI and Anthropic models#1
Open
pranav671 wants to merge 14 commits into
Open
feat(scraper): add Azure Foundry scraper for OpenAI and Anthropic models#1pranav671 wants to merge 14 commits into
pranav671 wants to merge 14 commits into
Conversation
Scrapes the Microsoft Foundry model retirement schedule page and extracts deprecation/retirement entries for two model families: - Azure OpenAI (h3#azure-openai) - Anthropic (h3#anthropic, partner/community section) All other families (Cohere, Meta, Mistral, xAI, fine-tuned, OSS, etc.) are intentionally excluded. Key implementation details: - Combines Model + Version columns into model_name as 'Model (Version)' when a real version is present; omits version when cell is empty/dash - Maps Lifecycle column (GA/Preview → active, Deprecated → deprecated, Retired → retired) and auto-promotes status to 'retired' when the retirement date is already in the past - Stops sibling iteration at ANY heading tag to prevent sub-section tables (e.g. h4#fine-tuned-models) from leaking into results - _is_empty_value() helper normalises all dash variants (ASCII hyphen, en/em dash U+2013/U+2014) and UTF-8→Latin-1 encoding artefacts (e.g. 'â' + control chars from mis-decoded em dash) so that empty version/replacement cells are never appended to model names Adds: - scraper/azure_foundry_scraper.py - tests/fixtures/azure_foundry.html - TestIsEmptyValue (9 unit tests for the encoding-artefact guard) - TestAzureFoundryScraper (8 integration tests)
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new scraper for the Azure Foundry model deprecation schedule, specifically tracking Azure OpenAI and Anthropic (partner/community) models.
It scrapes the official Azure Foundry OpenAI Model Retirement Schedule and extracts model names, versions, lifecycle statuses, retirement/shutdown dates, and replacement recommendations.
Motivation
I wanted to have model deprecations tracked for my application, and this repository has been excellent for that purpose. However, it was missing Azure Foundry (OpenAI & Anthropic) tracking, so I implemented this scraper to bridge the gap.
Key Implementation Details
h3#azure-openai) and Anthropic (h3#anthropic) models. All other provider tables (Cohere, Mistral, Meta, etc.) are intentionally ignored.h4#fine-tuned-models) from leaking into standard model results.ModelandVersioncolumns into the standardModel (Version)format when a version is present._is_empty_value()helper that normalizes all dash variants (ASCII hyphen, en dash–, em dash—) and strips garbled UTF-8 to Latin-1 decoding artifacts (e.g.â+ control characters arising from em dash mis-decoding). This prevents garbage strings from being appended to model names or appearing in the replacement column.GA/Previewtoactive,Deprecatedtodeprecated,Retiredtoretired) and auto-promotes any entry's status toretiredif the shutdown date is already in the past.Verification & Tests
TestAzureFoundryScraperintests/test_scrapers.pycovering 8 integration scenarios (standard OpenAI parsing, Anthropic parsing, section bounds, status mapping, and exclusions) utilizing a local HTML fixture (tests/fixtures/azure_foundry.html).TestIsEmptyValuewith 9 unit tests to explicitly verify that various empty, dash, and encoding-corrupted values (likeâartifacts) are handled and stripped correctly.Future Thoughts
scraper/folder to clean up duplication.