`analyzer_version` when using HF model family

When a CLAMS app wraps a single off-the-shelf tool, we use the `analyzer_version` field in the app metadata to specify the version of that component. This system works well for individual library/app/model.

However, this approach is insufficient for apps designed to wrap entire families of models from the huggingface hub. Models on the hub are version-controlled using git, where the most precise version identifier is a commit hash. However, a single CLAMS app can wrap multiple models from a family like `openai/whisper-*`. Each model in such a family (`whisper-large-v3`, `whisper-small`, etc.) resides in its own git repository and has an independent commit history. Consequently, there is no easy way to put a single version string or commit hash that can accurately represent the collective “family”.

For reference, the `openai/whisper` family, as of writing, consists of the following 12 models, each with a unique repository and latest commit:

| Model Name | Last Update Date | Commit |
| :--- | :--- | :--- |
| openai/whisper-large-v3-turbo | Oct 4, 2024 | `41f01f3fe87f28c78e2fbf8b568835947dd65ed9` |
| openai/whisper-large-v3 | Aug 12, 2024 | `06f233fe06e710322aca913c1bc4249a0d71fce1` |
| openai/whisper-large-v2 | Feb 29, 2024 | `ae4642769ce2ad8fc292556ccea8e901f1530655` |
| openai/whisper-large | Feb 29, 2024 | `4ef9b41f0d4fe232daafdb5f76bb1dd8b23e01d7` |
| openai/whisper-medium | Feb 29, 2024 | `abdf7c39ab9d0397620ccaea8974cc764cd0953e` |
| openai/whisper-small | Feb 29, 2024 | `973afd24965f72e36ca33b3055d56a652f456b4d` |
| openai/whisper-tiny | Feb 29, 2024 | `169d4a4341b33bc18d8881c4b69c2e104e1cc0af` |
| openai/whisper-base | Feb 29, 2024 | `e37978b90ca9030d5170a5c07aadb050351a65bb` |
| openai/whisper-medium.en | Jan 22, 2024 | `2e98eb6279edf5095af0c8dedb36bdec0acd172b` |
| openai/whisper-small.en | Jan 22, 2024 | `e8727524f962ee844a7319d92be39ac1bd25655a` |
| openai/whisper-tiny.en | Jan 22, 2024 | `87c7102498dcde7456f24cfd30239ca606ed9063` |
| openai/whisper-base.en | Jan 22, 2024 | `911407f4214e0e1d82085af863093ec0b66f9cd6` |

### Proposed Solution

I propose a two-part solution that combines developer best practices with a new metadata convention:

1. To ensure reproducibility, app developers should explicitly pin model versions in their code. For example, when calling `from_pretrained()`, the `revision` parameter should always be set to a specific commit hash. This should apply whenver a HF model is used, regardless of the app is a simple wrapper, or using the HF model as a "backbone", to ensure reproducibility. For example, 

``` python

model_pins = {
    "openai/whisper-large-v3": "06f233fe06e710322aca913c1bc4249a0d71fce1", 
    "openai/whisper-large-v3-turbo": "41f01f3fe87f28c78e2fbf8b568835947dd65ed9", 
    ... }

class SomeClamsWhisperApp(ClamsApp):
    ...

    def _annotate(self, mmif, params):
        model_name = params.get("model_name")
        whisper = from_pretrained(model_name, revision=model_pins[model_name])
```

1. Then, in the app metadata for apps that wraps a family of models, developers should use a date-based version string with a special mark to indicate that the app uses the latest model versions available *as of that date*. For example: `"analyzer_version": "hf-250619"`. Then, this convention clearly indicates:
   * The use of Hugging Face models (via the `hf-` prefix).
   * A "snapshot" date (`YYMMDD`) that establishes a reproducible baseline for all models in the family.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`analyzer_version` when using HF model family #251

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model Name	Last Update Date	Commit
openai/whisper-large-v3-turbo	Oct 4, 2024	`41f01f3fe87f28c78e2fbf8b568835947dd65ed9`
openai/whisper-large-v3	Aug 12, 2024	`06f233fe06e710322aca913c1bc4249a0d71fce1`
openai/whisper-large-v2	Feb 29, 2024	`ae4642769ce2ad8fc292556ccea8e901f1530655`
openai/whisper-large	Feb 29, 2024	`4ef9b41f0d4fe232daafdb5f76bb1dd8b23e01d7`
openai/whisper-medium	Feb 29, 2024	`abdf7c39ab9d0397620ccaea8974cc764cd0953e`
openai/whisper-small	Feb 29, 2024	`973afd24965f72e36ca33b3055d56a652f456b4d`
openai/whisper-tiny	Feb 29, 2024	`169d4a4341b33bc18d8881c4b69c2e104e1cc0af`
openai/whisper-base	Feb 29, 2024	`e37978b90ca9030d5170a5c07aadb050351a65bb`
openai/whisper-medium.en	Jan 22, 2024	`2e98eb6279edf5095af0c8dedb36bdec0acd172b`
openai/whisper-small.en	Jan 22, 2024	`e8727524f962ee844a7319d92be39ac1bd25655a`
openai/whisper-tiny.en	Jan 22, 2024	`87c7102498dcde7456f24cfd30239ca606ed9063`
openai/whisper-base.en	Jan 22, 2024	`911407f4214e0e1d82085af863093ec0b66f9cd6`

analyzer_version when using HF model family #251

Description

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`analyzer_version` when using HF model family #251