Skip to content

zkrebbekx/pgparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pgparse

pgparse

ci codecov Go Reference

A pure-Go PostgreSQL SQL parser. No cgo. No WebAssembly runtime. No native library to link. It compiles into any Go program as ordinary source and imposes zero per-process memory overhead at startup.

Type any PostgreSQL and watch it parse instantly — AST tree, read/write classification, and real µs/parse timing, running 100% client-side (no server, your SQL never leaves the tab).

The library (github.com/zkrebbekx/pgparse) is pure Go with no WebAssembly — nothing you import pulls in a wasm runtime. WebAssembly is used only by the separate playground/ module, which compiles the pure-Go library to a GOOS=js build so it can run in the browser. Importing pgparse into your Go program involves no wasm and no cgo.

It exists to fill the gap between the two common options:

Approach Cost
pg_query_go Wraps the real Postgres parser via cgo — needs a C toolchain, cross-compilation pain, and a non-trivial binary.
WebAssembly builds of libpg_query Pure Go to call, but the wasm runtime allocates a large linear memory region on startup for every instance.
pgparse Hand-written scanner + recursive-descent parser. Pure Go, zero deps, no startup allocation, low per-parse allocation.

pgparse targets a pragmatic, high-coverage subset of PostgreSQL DML rather than the entire grammar — see Coverage. If you need every Postgres node exactly, use pg_query_go. If you need to parse the SQL that real apps actually write, fast, with no cgo, this is for you.

Requires Go 1.20+. Zero runtime dependencies (goconvey is a test-only dep).

Install

go get github.com/zkrebbekx/pgparse

Usage

package main

import (
	"fmt"
	"log"

	"github.com/zkrebbekx/pgparse"
)

func main() {
	res, err := pgparse.Parse(`
		SELECT u.id, u.name, count(o.id) AS orders
		FROM users u
		LEFT JOIN orders o ON o.user_id = u.id
		WHERE u.active = true
		GROUP BY u.id, u.name
		HAVING count(o.id) > 0
		ORDER BY orders DESC
		LIMIT 10
	`)
	if err != nil {
		log.Fatal(err)
	}

	sel := res.Stmts[0].(*pgparse.SelectStmt)
	fmt.Println("columns:", len(sel.Columns))
	fmt.Println("from:", sel.From[0].(*pgparse.JoinExpr).Kind)
}

Single-statement helper:

stmt, err := pgparse.ParseOne("DELETE FROM logs WHERE created_at < $1")
del := stmt.(*pgparse.DeleteStmt)

Token stream only (e.g. for highlighting):

toks, _ := pgparse.Tokenize("SELECT 1 + 2")

Coverage

Statements

  • SELECT — projection with aliases, DISTINCT / DISTINCT ON, FROM with comma and explicit joins (INNER, LEFT, RIGHT, FULL, CROSS, OUTER, ON / USING, LATERAL subqueries), WHERE, GROUP BY, HAVING, ORDER BY (ASC/DESC, NULLS FIRST/LAST), LIMIT, OFFSET
  • WITH CTEs (incl. RECURSIVE) and subqueries in FROM / expressions
  • Set operations: UNION / INTERSECT / EXCEPT (ALL), with correct precedence (INTERSECT binds tighter) and a tail that binds to the whole expression
  • Window functions: OVER (PARTITION BY ... ORDER BY ...), frame clauses (ROWS/RANGE/GROUPS ... BETWEEN ... AND ..., EXCLUDE), named windows (OVER w), FILTER (WHERE ...), WITHIN GROUP (ORDER BY ...), and aggregate ORDER BY (array_agg(x ORDER BY y))
  • INSERT — column lists, multi-row VALUES, INSERT ... SELECT, ON CONFLICT (...) DO NOTHING | DO UPDATE SET ..., RETURNING
  • UPDATE — single and multi-column SET (a, b) = (v1, v2) / = (SELECT ...), FROM, WHERE, RETURNING
  • DELETEUSING, WHERE, RETURNING
  • DDLCREATE TABLE (column + table constraints: PRIMARY KEY, NOT NULL, UNIQUE, DEFAULT, CHECK, REFERENCES/FOREIGN KEY, IF NOT EXISTS, TEMP), CREATE [OR REPLACE] VIEW, CREATE [UNIQUE] INDEX (USING, partial WHERE, DESC), DROP TABLE/VIEW/INDEX (IF EXISTS, CASCADE), ALTER TABLE (ADD/DROP COLUMN, ADD/DROP CONSTRAINT, ALTER COLUMN … TYPE/SET/DROP DEFAULT/NOT NULL, RENAME)
  • Query features — VALUES lists (standalone, in FROM, as set-op operands), set-returning functions in FROM (WITH ORDINALITY), LATERAL and NATURAL joins, GROUP BY ROLLUP/CUBE/GROUPING SETS, ORDER BY … USING, FETCH FIRST … ROWS ONLY/WITH TIES, FOR UPDATE/SHARE locking
  • Expression extras — quantified op ANY/ALL (array | subquery), ARRAY[…] and ARRAY(subquery), IS [NOT] DISTINCT FROM, row/tuple constructors, the full open operator class (geometric, range, text-search operators)
  • Utility & admin statements (SET, SHOW, COPY, GRANT/REVOKE, ANALYZE, VACUUM, EXPLAIN, transaction control, TRUNCATE, COMMENT, CREATE TYPE/SEQUENCE/FUNCTION/…, and DDL options pgparse does not model) are recognised and validated as RawStmt — leading keyword plus verbatim text

Expressions

  • Logical (AND / OR / NOT), comparison, arithmetic, || concatenation — all with correct precedence and associativity
  • JSON/array operators (-> ->> #> #>> @> <@ ? ?| ?&), bitwise (& | # << >> ~), and regex match (~ ~* !~ !~*); array subscript and slice (a[1], a[1:3], a[:2])
  • CASE (simple + searched), CAST(x AS t) and x::t, typed literals (date '...'), INTERVAL '90' day
  • IN (list | subquery), BETWEEN, LIKE / ILIKE, IS [NOT] NULL/TRUE/FALSE
  • EXISTS (...), scalar subqueries, function calls (DISTINCT, count(*)), SQL special functions (extract, substring, position, trim, overlay)
  • Literals (string with '' escapes, dollar-quoted, int/float, bool, NULL), positional parameters $n, qualified names schema.table.column, table.*

Out of scope (use pg_query_go): MERGE, GRANT/REVOKE, PL/pgSQL bodies, COPY, multi-word type names (timestamp with time zone, double precision — use timestamptz etc.), the long tail of ALTER sub-commands, and exact pg_query node-tree compatibility. Operators are lexed by maximal munch, so adjacent operator characters without a space (e.g. a=-1) are read as one operator (=-); write a = -1.

The AST is idiomatic typed Go (see ast.go) — ergonomic to walk and pattern-match, not a protobuf mirror. Walk traverses it in pre-order:

stmt, _ := pgparse.ParseOne("SELECT * FROM a JOIN b ON a.id = b.a")
var tables []string
pgparse.Walk(stmt, func(n pgparse.Node) bool {
        if t, ok := n.(*pgparse.TableName); ok {
                tables = append(tables, t.Name)
        }
        return true // return false to skip a node's children
})
// tables == ["a", "b"]

Read vs. write classification

Mutates answers a common question in one call: does this SQL change data or schema? Useful as a routing or observability hint.

res, _ := pgparse.Parse("UPDATE accounts SET balance = balance - 10 WHERE id = $1")
res.Mutates()   // true

r2, _ := pgparse.Parse("SELECT * FROM accounts WHERE id = $1")
r2.Mutates()    // false  (r2.ReadOnly() == true)

Not a security boundary. Classification is purely syntactic, so it cannot see side effects inside functions — SELECT nextval('s') and SELECT writes_rows() are reported read-only because they look like reads. pgparse also parses a subset of PostgreSQL and is not the authoritative parser. Use Mutates as a hint; gate real writes with server-side controls (read-only roles, default_transaction_read_only, a read-replica connection).

Classify(stmt) gives the finer category — ClassReadOnly, ClassWrite (INSERT/UPDATE/DELETE, TRUNCATE, MERGE, data-modifying CTEs, SELECT … INTO), ClassDDL (CREATE/ALTER/DROP, GRANT, …), ClassTransaction (BEGIN/COMMIT/ROLLBACK/SAVEPOINT — non-mutating), or ClassUtility (recognised admin statements whose effect is not modelled, treated as possibly-writing). Data-modifying CTEs (WITH x AS (UPDATE …) SELECT …) are detected as writes, and EXPLAIN is treated as possibly-mutating because EXPLAIN ANALYZE executes its argument. SELECT … FOR UPDATE is classified read-only — it locks rows but does not change data (note it still requires a primary, not a read replica).

Deparse (AST → SQL)

Deparse renders any AST node back to SQL. It is deterministic and idempotent (deparse∘parse∘deparse == deparse∘parse), which makes it both a formatting utility and a round-trip test oracle — every query in the test corpus is verified to survive parse → print → re-parse unchanged.

stmt, _ := pgparse.ParseOne("select a,b from t where a>1")
fmt.Println(pgparse.Deparse(stmt))
// SELECT a, b FROM t WHERE a > 1

Robustness

Parse never panics: any internal panic is recovered and returned as an error, and a Go fuzz target (FuzzParse) drives the unguarded parser to prove it. Millions of executions over arbitrary and malformed input produce errors, never crashes or hangs.

go test -fuzz=FuzzParse -fuzztime=30s

The suite is BDD (goconvey), race-tested for concurrency, round-trips every parse through the deparser, and drives the full PostgreSQL regression corpus through parse → classify → deparse → re-parse — ~89% statement coverage (go test -cover).

Completeness vs other Go parsers

A feature-by-feature acceptance matrix (in comparison/) over 23 representative constructs, parsed by each engine. pg_query_go (the real PostgreSQL parser) is the fidelity baseline; pgparse and GoSQLX are the pure-Go contenders.

Versions benchmarked (latest releases at time of writing): pg_query_go v6.2.2 · GoSQLX v1.14.0 · go-pgquery (wasm, latest, wazero v1.12.0) · pgparse — this repo. Reproduce with make compare / make memcompare.

pgparse pg_query_go v6 GoSQLX v1.14
features accepted 23 / 23 23 / 23 20 / 23
multi-column UPDATE SET (a,b)=(…)
extract / substring keyword syntax
typed literal + INTERVAL '90' day
LATERAL join
DDL CREATE TABLE

pgparse matches the real-PostgreSQL baseline across all 23 representative constructs, and parses the multi-column UPDATE form GoSQLX rejects.

Breadth: the PostgreSQL regression suite

A 23-feature matrix only proves the happy path. For a breadth measure, pgparse is run against a subset of PostgreSQL's own regression test suite (src/test/regress/sql) — the SQL that libpg_query/pg_query_go is validated against. Each file is split into statements; a statement counts only when pg_query_go (the real parser) accepts it, which discards psql meta-commands and the suite's deliberately-invalid cases. The score is the share of those genuinely-valid statements each pure-Go parser also accepts:

statements accepted
pgparse 7,985 99.4%
GoSQLX v1.14 7,985 48.4%

pgparse accepts 99.4% of the real PostgreSQL statements pg_query_go accepts, far ahead of GoSQLX. Of those, DML, queries, and core DDL are parsed into a full typed AST; utility and administrative statements (SET, COPY, GRANT, ANALYZE, CREATE TYPE/SEQUENCE/FUNCTION, DROP ROLE, …) and the DDL options pgparse does not model (partitioning, storage parameters, …) are recognised as a RawStmt: the leading keyword plus the verbatim, delimiter-validated statement text. So "accepted" means recognised and lexically validated, with full structure for the DML/DDL core. The remaining ~2% are deep edge cases (subqueries nested inside aggregate ORDER BY, array slices as INSERT targets, …). For an exhaustive node tree of every statement, use pg_query_go.

vs pg_query_go, in one place

pg_query_go is the PostgreSQL parser (via cgo), so on coverage it is the 100% baseline. The trade is everything else. Measured over the ~5,300 regression statements both engines accept:

pgparse pg_query_go v6.2.2
statements accepted (valid PG) 98% 100%
latency / statement ~3.0 µs ~55 µs
speedup ~18×
memory / statement ~2.3 KB ~2.9 KB
allocations / statement ~18 ~57
cgo / C toolchain none required
startup memory cost none C runtime

Latency, memory, and allocations are measured over the statements pgparse parses into a full AST (RawStmt-recognised statements are excluded so both sides do real parsing work). In short: pg_query_go for exhaustive fidelity; pgparse when you want most of the SQL real apps write, ~18× faster, with no cgo.

Reproduce all of the above with make compare.

Memory, CPU & concurrency vs cgo / wasm

The original motivation for pgparse was a WebAssembly build of pg_query (wasilibs/go-pgquery, which runs libpg_query as a wasm module via wazero) using too much memory. That is real and measurable. Each engine here parsed the regression corpus repeatedly; RSS is the process peak (cgo and wasm memory does not appear in Go's MemStats). Apple M-series, one process per engine (make memcompare):

Single-threaded

engine ns / statement startup RSS peak RSS
pgparse ~2,600 3 MB 19 MB
GoSQLX v1.14 ~7,700 5 MB 21 MB
pg_query_go v6 (cgo) ~51,000 12 MB 26 MB
go-pgquery (wasm) ~95,000 255 MB 273 MB

8 concurrent workers

engine ns / statement peak RSS
pgparse ~1,240 26 MB
GoSQLX v1.14 ~4,500 28 MB
pg_query_go v6 (cgo) ~7,500 31 MB
go-pgquery (wasm) ~28,300 2,250 MB

The wasm memory issue, and why pgparse avoids it. WebAssembly linear memory can only growmemory.grow never shrinks, and freed blocks are not returned to the OS. A wasm parser therefore ratchets up to the high-water mark of the largest/most queries it has seen and stays there; because a wazero instance is single-threaded, concurrency needs a pool of instances, each holding its own grown memory. In the table above that compounds to ~2.4 GB under 8 workers, on top of a 255 MB fixed startup cost.

pgparse has none of this: it is pure Go with no linear memory, so each parse allocates only its AST (tokens alias the input string — zero copy) and the garbage collector reclaims it and returns pages to the OS. Memory stays flat (~26 MB) under 8× concurrency, ~95× less than the wasm build and below the cgo binding, while running ~5–20× faster.

Concurrency safety. pgparse holds no shared mutable state — the keyword and word tables are read-only, and each Parse builds its own lexer and parser — so it is safe to call from any number of goroutines with no pool and no lock. This is enforced by a -race test (concurrent_test.go) running tens of thousands of overlapping parses.

Performance

Hand-written byte scanner (no regex) producing tokens whose values alias the source string, feeding a single-pass recursive-descent parser. No reflection, no intermediate string copies in the hot path.

Measured on Go 1.26 / arm64 (go test -bench=. -benchmem):

BenchmarkParse/simple-10        774518     1564 ns/op   25.58 MB/s    1488 B/op    19 allocs/op
BenchmarkParse/join-10          194725     6043 ns/op   41.04 MB/s    5168 B/op    65 allocs/op
BenchmarkParse/cte_window-10    195241     6195 ns/op   49.56 MB/s    5752 B/op    64 allocs/op
BenchmarkParse/expr_heavy-10    127866     9399 ns/op   27.66 MB/s   11984 B/op    90 allocs/op
BenchmarkParse/insert-10        268135     4464 ns/op   43.23 MB/s    3448 B/op    41 allocs/op
BenchmarkTokenize-10            399202     2997 ns/op   82.74 MB/s    3176 B/op    14 allocs/op

Unlike cgo/wasm approaches, there is no fixed startup memory cost — a process that never parses pays nothing, and each parse allocates only its own AST.

Reproduce:

make bench

Head-to-head vs pg_query_go

Benchmarked over the TPC-H query corpus (all 22 queries parse). Both engines parse the identical queries; pg_query_go v6.2.2 is the cgo binding around the real PostgreSQL parser. Apple M-series, go test -bench=Corpus -benchmem:

Engine ns / query B / query allocs / query
pgparse ~14,200 ~9,700 ~94
pg_query_go v6 ~376,000 ~17,700 ~392
pgparse advantage ~26× faster ~2× less ~4× fewer

pg_query_go does strictly more — it produces the full-fidelity PostgreSQL node tree for every statement kind. Its per-call cost includes the cgo boundary and protobuf (de)serialization of that tree, which is the real price a Go program pays to use it. pgparse trades exhaustive fidelity for a lean Go AST, no cgo, and the throughput above. Pick accordingly.

The comparison lives in a separate module (comparison/) so the root module never pulls in cgo. Reproduce:

make compare        # see Makefile; sets the macOS CGO workaround

Coverage of the corpus is asserted by TestTPCHCoverage (22/22).

Stability

pgparse follows Semantic Versioning. The public API — the top-level functions (Parse, ParseOne, Tokenize, Deparse, Classify, Mutates) and the exported AST types in ast.go / DDL nodes — is the compatibility surface.

As of v1.0.0 the public API is covered by SemVer:

  • No breaking changes to the public API within a major version. New AST fields and new statement/expression kinds may be added (so always include a default case when type-switching over Stmt/Expr); fields and types are not removed or renamed except in a new major version.
  • Parse never panics and bounds recursion and input size.

Check the CHANGELOG before upgrading.

Safety on untrusted input and the limits of Mutates/Classify are described in SECURITY.md.

Security

Run on untrusted input safely: Parse never panics, recursion depth is bounded, and MaxInputBytes caps input size. See SECURITY.md for the threat model and how to report a vulnerability. Note that Mutates/Classify is a syntactic hint, not a security control.

License

MIT — see LICENSE.

About

Pure-Go PostgreSQL SQL parser — no cgo, no WebAssembly, zero startup memory. Parses a pragmatic Postgres DML subset into an idiomatic Go AST.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors