diff --git a/.github/styles/README.md b/.github/styles/README.md new file mode 100644 index 000000000..3c72b9212 --- /dev/null +++ b/.github/styles/README.md @@ -0,0 +1,25 @@ +# Vale Styles + +This directory contains Vale linting configuration for ipfs-docs. + +## Spelling Rules + +There are two spelling systems: + +1. **`Vocab/ipfs-docs-vocab/accept.txt`** - General Vale vocabulary +2. **`pln-ignore.txt`** - Custom ignore file for `docs/PLNSpelling.yml` + +### Fixing PLNSpelling Errors + +When CI fails with `[docs.PLNSpelling] Did you really mean 'word'?`: + +1. Add the word to **`pln-ignore.txt`** (lowercase) +2. Do NOT add to `Vocab/accept.txt` - that file is for other Vale rules + +The `PLNSpelling.yml` rule explicitly references `pln-ignore.txt`: + +```yaml +extends: spelling +ignore: + - pln-ignore.txt +``` diff --git a/.github/styles/pln-ignore.txt b/.github/styles/pln-ignore.txt index 981301004..34cedd32a 100644 --- a/.github/styles/pln-ignore.txt +++ b/.github/styles/pln-ignore.txt @@ -27,6 +27,7 @@ boolean Bootstrappers boxo browserify +buzhash caddy Caddyfile callout @@ -41,6 +42,7 @@ clis cmds cnames codec +codecs codecov coinlist composable @@ -56,6 +58,7 @@ dapps data('s) datastore deduplicate +deduplication Denylist denylist dep @@ -200,6 +203,7 @@ philz pinset pipeable plaintext +PLNSpelling pluggable powergate powershell diff --git a/docs/concepts/content-addressing.md b/docs/concepts/content-addressing.md index 3edeea661..f3f23ce4e 100644 --- a/docs/concepts/content-addressing.md +++ b/docs/concepts/content-addressing.md @@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`. +### Why the hashes differ + +The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes. + +When you add a file to IPFS, the data goes through several transformations: + +1. **Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each) +2. **Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph) +3. **Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure + +The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID. + +#### When CID hash equals file hash + +There is one case where the Multihash does equal the file's hash: when the CID uses the `raw` [codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents. + +#### Same file, different CIDs + +Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured: + +- **Chunk size**: Different chunking strategies produce different block trees +- **DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently +- **Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently +- **CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats +- **Hash algorithm**: sha2-256, blake3, and others produce different hashes + +#### Why this flexibility matters + +This is a feature, not a limitation. Different structures optimize for different use cases: + +- **DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs +- **Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics +- **Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use +- **Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified + +[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs. + +When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings. + +To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). + ## CID versions diff --git a/docs/concepts/how-ipfs-works.md b/docs/concepts/how-ipfs-works.md index ebf1ce267..3edb5d0fb 100644 --- a/docs/concepts/how-ipfs-works.md +++ b/docs/concepts/how-ipfs-works.md @@ -45,13 +45,13 @@ IPFS represents data as content-addressed blocks, which are assigned a unique identifier called a Content Identifier (CID). In general, the CID is computed by combining the hash of the data with its codec. The codec is generated using Multiformats. -CIDs are unique to the data from which they were computed, which provides IPFS with the following benefits: -- Data can be fetched based on its content, rather than its location. -- The CID of the data received can be computed and compared to the CID requested, to verify that the data is what was requested. +Because CIDs are based on content, not location: +- You can fetch data by *what it is*, not where it's stored. +- You can verify data by recomputing the CID and comparing it to what you requested. :::callout **Learn more** -Learn more about the concepts behind CIDs described here with the [the CID deep dive](../concepts/content-addressing.md#cid-versions). +Learn more about CIDs in the [CID deep dive](../concepts/content-addressing.md#cid-versions). ::: diff --git a/docs/quickstart/pin-cli.md b/docs/quickstart/pin-cli.md index cc2f8aa5b..7e6490429 100644 --- a/docs/quickstart/pin-cli.md +++ b/docs/quickstart/pin-cli.md @@ -122,22 +122,22 @@ Each method will return a **CID** (Content Identifier) for your uploaded file. S ## CIDs explained -In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)). The CID serves as the **permanent address** of the file and can be used by anyone to find it on the IPFS network. +In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)), a unique hash derived from the file's contents. The CID serves as the **permanent address** of the file and can be used by anyone to find it on any IPFS network or system. -When a file is first added to an IPFS node (like the image used in this guide), it's first transformed into a content-addressable representation in which the file is split into smaller chunks (if above ~1MB) which are linked together and hashed to produce the CID. +When you add a file to IPFS, the system generates its CID by hashing the contents. Larger files (above ~1MB) are split into smaller chunks, linked together, and hashed. -For example, a CID might look like: +The resulting CID might look like this: ```plaintext bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4 ``` -You can now share the CID with anyone and they can fetch the file using IPFS. +Once you have a CID, you can share it with anyone and they can fetch the file using IPFS. -To dive deeper into the anatomy of the CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). +To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). :::callout -The transformation into a content-addressable representation is a local operation that doesn't require any network connectivity. Many CLI tools perform this transformation locally before uploading. +**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details. ::: ## Retrieving with a gateway @@ -167,6 +167,8 @@ curl https://[BUCKET_NAME].ipfs.filebase.io/ipfs/[CID] ### Using public gateways +You can also use [public IPFS gateways](../concepts/public-utilities.md#public-ipfs-gateways): + ```shell curl https://ipfs.io/ipfs/[CID] # or @@ -192,4 +194,4 @@ Possible next steps include: - [Storacha CLI documentation](https://docs.storacha.network/cli/) - [Pinata API documentation](https://docs.pinata.cloud/) - [Filebase S3 API guide](https://docs.filebase.com/api-documentation/s3-compatible-api) - - [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api) \ No newline at end of file + - [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)