Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .github/styles/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Vale Styles

This directory contains Vale linting configuration for ipfs-docs.

## Spelling Rules

There are two spelling systems:

1. **`Vocab/ipfs-docs-vocab/accept.txt`** - General Vale vocabulary
2. **`pln-ignore.txt`** - Custom ignore file for `docs/PLNSpelling.yml`

### Fixing PLNSpelling Errors

When CI fails with `[docs.PLNSpelling] Did you really mean 'word'?`:

1. Add the word to **`pln-ignore.txt`** (lowercase)
2. Do NOT add to `Vocab/accept.txt` - that file is for other Vale rules

The `PLNSpelling.yml` rule explicitly references `pln-ignore.txt`:

```yaml
extends: spelling
ignore:
- pln-ignore.txt
```
4 changes: 4 additions & 0 deletions .github/styles/pln-ignore.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ boolean
Bootstrappers
boxo
browserify
buzhash
caddy
Caddyfile
callout
Expand All @@ -41,6 +42,7 @@ clis
cmds
cnames
codec
codecs
codecov
coinlist
composable
Expand All @@ -56,6 +58,7 @@ dapps
data('s)
datastore
deduplicate
deduplication
Denylist
denylist
dep
Expand Down Expand Up @@ -200,6 +203,7 @@ philz
pinset
pipeable
plaintext
PLNSpelling
pluggable
powergate
powershell
Expand Down
41 changes: 41 additions & 0 deletions docs/concepts/content-addressing.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match

As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`.

### Why the hashes differ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mishmosh @aschmahmann does this extra explainer sound good?

The main point I want to make here is that this is a feature, not a limitation. And at the end we mention community-provided profiles for cases where specific preset is required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the add.


The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes.

When you add a file to IPFS, the data goes through several transformations:

1. **Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each)
2. **Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph)
3. **Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure

The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID.

#### When CID hash equals file hash

There is one case where the Multihash does equal the file's hash: when the CID uses the `raw` [codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents.

#### Same file, different CIDs

Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured:

- **Chunk size**: Different chunking strategies produce different block trees
- **DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently
- **Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently
- **CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats
- **Hash algorithm**: sha2-256, blake3, and others produce different hashes

#### Why this flexibility matters

This is a feature, not a limitation. Different structures optimize for different use cases:

- **DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs
- **Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics
- **Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use
- **Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified

[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs.

When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings.

To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).


## CID versions

Expand Down
8 changes: 4 additions & 4 deletions docs/concepts/how-ipfs-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,13 @@ IPFS represents data as content-addressed <VueCustomTooltip label="The term for

In IPFS, data is chunked into <VueCustomTooltip label="The term for a single unit of data in IPFS." underlined multiline is-medium>blocks</VueCustomTooltip>, which are assigned a unique identifier called a <VueCustomTooltip label="An address used to point to data in IPFS, based on the content itself, as opposed to the location." underlined multiline is-medium>Content Identifier (CID)</VueCustomTooltip>. In general, the CID is computed by combining the hash of the data with its <VueCustomTooltip label="Software capable of encoding and/or decoding data." underlined multiline is-medium>codec</VueCustomTooltip>. The codec is generated using <VueCustomTooltip label="A collection of interoperable, extensible protocols for making data self-describable." underlined multiline is-medium>Multiformats</VueCustomTooltip>.

CIDs are unique to the data from which they were computed, which provides IPFS with the following benefits:
- Data can be fetched based on its content, rather than its location.
- The CID of the data received can be computed and compared to the CID requested, to verify that the data is what was requested.
Because CIDs are based on content, not location:
- You can fetch data by *what it is*, not where it's stored.
- You can verify data by recomputing the CID and comparing it to what you requested.

:::callout
**Learn more**
Learn more about the concepts behind CIDs described here with the [the CID deep dive](../concepts/content-addressing.md#cid-versions).
Learn more about CIDs in the [CID deep dive](../concepts/content-addressing.md#cid-versions).
:::


Expand Down
16 changes: 9 additions & 7 deletions docs/quickstart/pin-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,22 +122,22 @@ Each method will return a **CID** (Content Identifier) for your uploaded file. S

## CIDs explained

In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)). The CID serves as the **permanent address** of the file and can be used by anyone to find it on the IPFS network.
In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)), a unique hash derived from the file's contents. The CID serves as the **permanent address** of the file and can be used by anyone to find it on any IPFS network or system.

When a file is first added to an IPFS node (like the image used in this guide), it's first transformed into a content-addressable representation in which the file is split into smaller chunks (if above ~1MB) which are linked together and hashed to produce the CID.
When you add a file to IPFS, the system generates its CID by hashing the contents. Larger files (above ~1MB) are split into smaller chunks, linked together, and hashed.

For example, a CID might look like:
The resulting CID might look like this:

```plaintext
bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4
```

You can now share the CID with anyone and they can fetch the file using IPFS.
Once you have a CID, you can share it with anyone and they can fetch the file using IPFS.

To dive deeper into the anatomy of the CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).

:::callout
The transformation into a content-addressable representation is a local operation that doesn't require any network connectivity. Many CLI tools perform this transformation locally before uploading.
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ we already had "CIDs are not file hashes" content elsewhere, so adjusted this to link there + expanded there

:::

## Retrieving with a gateway
Expand Down Expand Up @@ -167,6 +167,8 @@ curl https://[BUCKET_NAME].ipfs.filebase.io/ipfs/[CID]

### Using public gateways

You can also use [public IPFS gateways](../concepts/public-utilities.md#public-ipfs-gateways):

```shell
curl https://ipfs.io/ipfs/[CID]
# or
Expand All @@ -192,4 +194,4 @@ Possible next steps include:
- [Storacha CLI documentation](https://docs.storacha.network/cli/)
- [Pinata API documentation](https://docs.pinata.cloud/)
- [Filebase S3 API guide](https://docs.filebase.com/api-documentation/s3-compatible-api)
- [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)
- [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)