-
Notifications
You must be signed in to change notification settings - Fork 1
Description
When a CLAMS app wraps a single off-the-shelf tool, we use the analyzer_version field in the app metadata to specify the version of that component. This system works well for individual library/app/model.
However, this approach is insufficient for apps designed to wrap entire families of models from the huggingface hub. Models on the hub are version-controlled using git, where the most precise version identifier is a commit hash. However, a single CLAMS app can wrap multiple models from a family like openai/whisper-*. Each model in such a family (whisper-large-v3, whisper-small, etc.) resides in its own git repository and has an independent commit history. Consequently, there is no easy way to put a single version string or commit hash that can accurately represent the collective “family”.
For reference, the openai/whisper family, as of writing, consists of the following 12 models, each with a unique repository and latest commit:
| Model Name | Last Update Date | Commit |
|---|---|---|
| openai/whisper-large-v3-turbo | Oct 4, 2024 | 41f01f3fe87f28c78e2fbf8b568835947dd65ed9 |
| openai/whisper-large-v3 | Aug 12, 2024 | 06f233fe06e710322aca913c1bc4249a0d71fce1 |
| openai/whisper-large-v2 | Feb 29, 2024 | ae4642769ce2ad8fc292556ccea8e901f1530655 |
| openai/whisper-large | Feb 29, 2024 | 4ef9b41f0d4fe232daafdb5f76bb1dd8b23e01d7 |
| openai/whisper-medium | Feb 29, 2024 | abdf7c39ab9d0397620ccaea8974cc764cd0953e |
| openai/whisper-small | Feb 29, 2024 | 973afd24965f72e36ca33b3055d56a652f456b4d |
| openai/whisper-tiny | Feb 29, 2024 | 169d4a4341b33bc18d8881c4b69c2e104e1cc0af |
| openai/whisper-base | Feb 29, 2024 | e37978b90ca9030d5170a5c07aadb050351a65bb |
| openai/whisper-medium.en | Jan 22, 2024 | 2e98eb6279edf5095af0c8dedb36bdec0acd172b |
| openai/whisper-small.en | Jan 22, 2024 | e8727524f962ee844a7319d92be39ac1bd25655a |
| openai/whisper-tiny.en | Jan 22, 2024 | 87c7102498dcde7456f24cfd30239ca606ed9063 |
| openai/whisper-base.en | Jan 22, 2024 | 911407f4214e0e1d82085af863093ec0b66f9cd6 |
Proposed Solution
I propose a two-part solution that combines developer best practices with a new metadata convention:
- To ensure reproducibility, app developers should explicitly pin model versions in their code. For example, when calling
from_pretrained(), therevisionparameter should always be set to a specific commit hash. This should apply whenver a HF model is used, regardless of the app is a simple wrapper, or using the HF model as a "backbone", to ensure reproducibility. For example,
model_pins = {
"openai/whisper-large-v3": "06f233fe06e710322aca913c1bc4249a0d71fce1",
"openai/whisper-large-v3-turbo": "41f01f3fe87f28c78e2fbf8b568835947dd65ed9",
... }
class SomeClamsWhisperApp(ClamsApp):
...
def _annotate(self, mmif, params):
model_name = params.get("model_name")
whisper = from_pretrained(model_name, revision=model_pins[model_name])- Then, in the app metadata for apps that wraps a family of models, developers should use a date-based version string with a special mark to indicate that the app uses the latest model versions available as of that date. For example:
"analyzer_version": "hf-250619". Then, this convention clearly indicates:- The use of Hugging Face models (via the
hf-prefix). - A "snapshot" date (
YYMMDD) that establishes a reproducible baseline for all models in the family.
- The use of Hugging Face models (via the
Metadata
Metadata
Assignees
Labels
Type
Projects
Status