Skip to content

thepriben/StatsWiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StatsWiki

Most-read articles on English Wikipedia — daily rankings from July 1, 2015 to yesterday.

Live site: https://statswiki.info/

X: https://x.com/statswiki

Bluesky: https://bsky.app/profile/statswiki.bsky.social

Preprint (Wikirace)From Events to Encyclopedic Attention

MIT license — fork for another language or project → ADAPT.md.


At a glance

Data Wikimedia Pageviews API → Parquet → static JSON
Site Vue 3 SPA on GitHub Pages (no runtime API calls)
Updates Daily cron + manual backfill for history
Enrichment Wikidata QID, label, description, image
Rankings Top 50 per day, month, year, all-time

What the site shows

Home

Three live panels (top 50 each), with fallback to the latest available period when yesterday / this month are not yet ingested:

  • Yesterday (or latest day)
  • This month (or latest month)
  • This year

Period pages

View URL Content
Day /2026/05/31 Top 50 that day
Month /2026/05 Top 50 aggregated over the month
Year /2026 Top 50 aggregated over the year
All time /alltime Top 50 since July 2015

Browse via Year / Month / Day dropdowns in the header (no date in the page title).

Article stats (QID)

Click a Wikidata QID in any table → /q/Q22686 with monthly / yearly view charts, total views, peak period.

Each row: rank, Wikipedia link, QID, description, thumbnail (links to Wikimedia Commons), view count.

Wikirace

Compare daily Wikipedia pageviews for a group of articles over any date range. Methodology in the Wikirace preprint above.

View URL Content
Builder /wikirace Search catalog, pick articles, set dates
Race /wikirace/Q1+Q2/YYYY-MM-DD/YYYY-MM-DD Chart, Race% table, shareable link
Help /wikirace/help Public guide (from docs/wikirace-help.md)

Race% = one article’s views as a % of the group total (area under the curve). Data is fetched live from the Wikimedia Pageviews API.

Docs: docs/wikirace.md (maintainer README) · docs/wikirace-help.md (public help → npm run build:help)


Architecture

Wikimedia Pageviews API     one HTTP request per day
         │
         ▼
data/pageviews/             Parquet (date, article, views, rank)
data/articles.parquet       Wikidata catalog
         │
         ▼  aggregate + merge by QID
web/public/data/            static JSON (top 50 per period)
         │
         ▼
Vue 3 SPA                   GitHub Pages CDN

Day → month → year: months and years are sums of daily rows, never fetched separately. See consolidation below.

Redirects: old article titles that share a Wikidata item have views merged before ranking.


Quick start (local)

# Pipeline
cd pipeline && python3 -m venv .venv && source .venv/bin/activate
pip install -e .

sw-fetch --date 2026-05-01          # one day
sw-backfill --year 2026               # full year
sw-daily                              # yesterday + export
sw-export-qids                        # QID time-series JSON

# Frontend
cd web && npm ci && npm run dev
# → http://localhost:5173/

Deployment (GitHub Pages)

Custom domain: statswiki.info — DNS at the registrar, web/public/CNAME, and Settings → Pages → Custom domain on thepriben/StatsWiki.

  1. Settings → Pages → Source: GitHub Actions (one-time).
  2. Push to mainDeploy Pages runs when web/ or data/ changes.
  3. Backfill and daily workflows commit data, then deploy in the same run.
Workflow Trigger Role
Deploy Pages Push or manual Build Vue → publish
Daily update 08:00 & 14:00 UTC or manual Yesterday → daily top 5 + period posts → commit → deploy
Backfill Manual (pick year) One year of history
Backfill sequence Manual 2025 → 2016 in one job

Backfill order (recommended)

  1. Current year first — homepage needs recent data.
  2. Backfill sequence (or year-by-year) down to 2015 (July 1 for 2015).
  3. Leave Daily update enabled.

~5–10 minutes per year on GitHub Actions.

Daily fetch schedule

Wikimedia publishes top/day pageviews roughly 24 hours after UTC midnight. The workflow runs twice:

Run UTC Purpose
Primary 08:00 Fetch yesterday, enrich, export
Retry 14:00 Same pipeline if morning data was not ready

If data is not available yet: the fetch retries up to 3× per attempt (with backoff), then the job exits without commit or deploy. The 14:00 run tries again automatically.

If yesterday is already in the database (e.g. after a successful morning run), the fetch is skipped but enrich/export still run — useful if Wikidata mapping changed.

Social posts (@statswiki on X and Bluesky)

After each successful daily run:

Trigger When Post
Day Every run Top 5 for yesterday
Week Yesterday was Sunday Top 5 for Mon–Sun (e.g. Mon 26 May – Sun 1 Jun 2026)
Month Yesterday was the last day of the month Top 5 for that month
Year Yesterday was 31 December Top 5 for that year

Manual dry-run: sw-period-posts --dry-run --date YYYY-MM-DD --force


Repository layout

StatsWiki/
├── web/                         # Vue 3 frontend
│   ├── src/
│   │   ├── App.vue              # routing, header, home
│   │   ├── QidPage.vue          # article stats + chart
│   │   ├── RankingTable.vue
│   │   ├── wikirace/            # Wikirace feature
│   │   └── lib.js
│   ├── public/wikirace/         # groups.json, catalog.json, help.json
│   └── public/data/             # generated JSON (+ q/Q*.json)
├── docs/
│   ├── wikirace.md              # Wikirace maintainer README
│   └── wikirace-help.md         # Wikirace public help (English)
├── data/                        # Parquet source of truth
│   ├── pageviews/year=Y/month=M/
│   ├── articles.parquet
│   └── manifest.json
├── pipeline/src/statswiki/      # Python ETL
└── .github/workflows/

Pipeline commands

Command Purpose
sw-fetch --date YYYY-MM-DD Ingest one day
sw-backfill --year YYYY Ingest year + Wikidata top 1000 + export
sw-daily Yesterday + enrich + export recent
sw-enrich --top 500 Re-enrich top articles by total views
sw-enrich --refresh-shadows 100 Retry unresolved QIDs
sw-export --recent Rebuild yesterday / month / year / alltime JSON
sw-export --year YYYY Export all periods for one year
sw-export-qids Export data/q/Q*.json time series for charts
sw-wikirace-catalog Export web/public/wikirace/catalog.json for autocomplete
sw-period-posts Post week/month/year top 5 to X and Bluesky when due
npm (in web/) Purpose
npm run build:help docs/wikirace-help.mdweb/public/wikirace/help.json

All ingest is idempotent — existing days are skipped.


Data model

Pageviews (data/pageviews/)

Column Description
date Day
article Title with underscores (as in API)
views View count
rank Position in daily top ~1000

Articles catalog (data/articles.parquet)

Column Description
article Pageview title
qid Wikidata QID (e.g. Q22686)
resolved_title Canonical title after Wikipedia redirects
label, description, image From Wikidata
updated_at Last enrichment

Export JSON (web/public/data/)

Each file has period, lines (array of ranked articles), and optionally nav (sub-links on year/month views).

Field Description
rank 1–50
title Wikipedia title (Article_Name)
label Display name from Wikidata
description Short Wikidata description
views View count for the period
qid Wikidata ID (e.g. Q12345)
image Commons thumbnail URL

manifest.jsonstart, end, updated, language.


Day → month → year

1 API call / day  →  Parquet row per (date, article)
                         │
                         ├─ SUM(days in month)  →  month/YYYY/MM.json
                         ├─ SUM(days in year)   →  year/YYYY.json
                         └─ SUM(all days)       →  alltime.json

Wikidata

Batched enrichment (50 titles / request):

  1. QID — Wikipedia pageprops, follows redirects
  2. Fallbacks — Wikidata search + opensearch
  3. Entity — label, description, image (P18 / P154)
  4. Export — merge views by QID before top-50 ranking

Manual overrides in filters.py for edge cases. Shadow QIDs (Q_en_…) retried on high-traffic articles.

Modules: wikidata.py, mapping.py, qid_export.py.


Fork for another language

This repo tracks English Wikipedia only. To run StatsWiki for French, German, Japanese, etc.:

ADAPT.md — step-by-step fork guide (config, Pages URL, Wikidata language, backfill).

Multi-language in a single site is not implemented. One fork per language is the intended model. Pull requests to this repo are not accepted — fork under MIT and maintain your own copy.


License

Code: MIT

Data (Wikipedia / Wikidata content shown on the site): Wikimedia Terms of Use, Wikidata CC0 (Commons images retain their own licenses).

About

Lightweight, forkable Wikipedia pageview rankings powered by Wikidata and Parquet, here English version.

Topics

Resources

License

Stars

Watchers

Forks

Contributors