Skip to content

Make sure inline links are fully qualified URLs during scrape process #12

@MasterOdin

Description

@MasterOdin

When scrapping documentation pages from the web, we should make sure that any links are converted to fully qualified version of themselves (e.g. going from something like:

[migrate your entire database at once](/self-hosted/latest/migration/entire-database/]

to

[migrate your entire database at once](https://docs.tigerdata.com/self-hosted/latest/migration/entire-database/]

Right now the LLM likes to quote the returned markdown chunks where the former end up showing as weird broken text vs the latter. While we could maybe fix this via prompting as well, I think better to just eat the extra tokens in embedding and then make it easier for the LLMs to use.

It'll probably be easier/better though to try to do this manipulation against the HTML source, vs after we convert it to markdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions