Skip to content

analyzer_version when using HF model family #251

@keighrim

Description

@keighrim

When a CLAMS app wraps a single off-the-shelf tool, we use the analyzer_version field in the app metadata to specify the version of that component. This system works well for individual library/app/model.

However, this approach is insufficient for apps designed to wrap entire families of models from the huggingface hub. Models on the hub are version-controlled using git, where the most precise version identifier is a commit hash. However, a single CLAMS app can wrap multiple models from a family like openai/whisper-*. Each model in such a family (whisper-large-v3, whisper-small, etc.) resides in its own git repository and has an independent commit history. Consequently, there is no easy way to put a single version string or commit hash that can accurately represent the collective “family”.

For reference, the openai/whisper family, as of writing, consists of the following 12 models, each with a unique repository and latest commit:

Model Name Last Update Date Commit
openai/whisper-large-v3-turbo Oct 4, 2024 41f01f3fe87f28c78e2fbf8b568835947dd65ed9
openai/whisper-large-v3 Aug 12, 2024 06f233fe06e710322aca913c1bc4249a0d71fce1
openai/whisper-large-v2 Feb 29, 2024 ae4642769ce2ad8fc292556ccea8e901f1530655
openai/whisper-large Feb 29, 2024 4ef9b41f0d4fe232daafdb5f76bb1dd8b23e01d7
openai/whisper-medium Feb 29, 2024 abdf7c39ab9d0397620ccaea8974cc764cd0953e
openai/whisper-small Feb 29, 2024 973afd24965f72e36ca33b3055d56a652f456b4d
openai/whisper-tiny Feb 29, 2024 169d4a4341b33bc18d8881c4b69c2e104e1cc0af
openai/whisper-base Feb 29, 2024 e37978b90ca9030d5170a5c07aadb050351a65bb
openai/whisper-medium.en Jan 22, 2024 2e98eb6279edf5095af0c8dedb36bdec0acd172b
openai/whisper-small.en Jan 22, 2024 e8727524f962ee844a7319d92be39ac1bd25655a
openai/whisper-tiny.en Jan 22, 2024 87c7102498dcde7456f24cfd30239ca606ed9063
openai/whisper-base.en Jan 22, 2024 911407f4214e0e1d82085af863093ec0b66f9cd6

Proposed Solution

I propose a two-part solution that combines developer best practices with a new metadata convention:

  1. To ensure reproducibility, app developers should explicitly pin model versions in their code. For example, when calling from_pretrained(), the revision parameter should always be set to a specific commit hash. This should apply whenver a HF model is used, regardless of the app is a simple wrapper, or using the HF model as a "backbone", to ensure reproducibility. For example,
model_pins = {
    "openai/whisper-large-v3": "06f233fe06e710322aca913c1bc4249a0d71fce1", 
    "openai/whisper-large-v3-turbo": "41f01f3fe87f28c78e2fbf8b568835947dd65ed9", 
    ... }

class SomeClamsWhisperApp(ClamsApp):
    ...

    def _annotate(self, mmif, params):
        model_name = params.get("model_name")
        whisper = from_pretrained(model_name, revision=model_pins[model_name])
  1. Then, in the app metadata for apps that wraps a family of models, developers should use a date-based version string with a special mark to indicate that the app uses the latest model versions available as of that date. For example: "analyzer_version": "hf-250619". Then, this convention clearly indicates:
    • The use of Hugging Face models (via the hf- prefix).
    • A "snapshot" date (YYMMDD) that establishes a reproducible baseline for all models in the family.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions