@uniweb/semantic-parser

A semantic parser for ProseMirror/TipTap content structures that helps bridge the gap between natural content writing and component-based web development.

What it Does

The parser transforms rich text editor content (ProseMirror/TipTap) into structured, semantic groups that web components can easily consume. It provides two complementary views of your content:

Sequence: An ordered list of all content elements (for rendering in document order)
Groups: Content organized into semantic sections (main content + items)

Installation

npm install @uniweb/semantic-parser

Quick Start

import { parseContent } from "@uniweb/semantic-parser";

// Your ProseMirror/TipTap document
const doc = {
  type: "doc",
  content: [
    {
      type: "heading",
      attrs: { level: 1 },
      content: [{ type: "text", text: "Welcome" }],
    },
    {
      type: "paragraph",
      content: [{ type: "text", text: "Get started today." }],
    },
  ],
};

// Parse the content
const result = parseContent(doc);

// Access different views
console.log(result.sequence);  // Ordered array of elements
console.log(result.title);     // Main content fields at top level
console.log(result.items);     // Additional content groups

Output Structure

Sequence View

An ordered array of semantic elements preserving document order:

result.sequence = [
  { type: "heading", level: 1, content: "Welcome" },
  { type: "paragraph", content: "Get started today." }
]

Content Structure

Main content fields are at the top level. The items array contains additional content groups (created when headings appear after content), each with the same field structure:

result = {
  // Header fields (from headings)
  pretitle: "",             // Heading before main title
  title: "Welcome",         // Main heading
  subtitle: "",             // Heading after main title

  // Body fields
  paragraphs: ["Get started today."],
  links: [],                // All links (including buttons, documents)
  images: [],
  videos: [],
  icons: [],
  lists: [],
  quotes: [],
  data: {},                 // Structured data (tagged data blocks, forms, cards)
  headings: [],             // Headings after subtitle, in document order

  // Additional content groups (from headings after content)
  items: [
    { title: "Feature 1", paragraphs: [...], links: [...] },
    { title: "Feature 2", paragraphs: [...], links: [...] }
  ],

  // Ordered sequence for document-order rendering
  sequence: [...],

  // Original document
  raw: { type: "doc", content: [...] }
}

Common Use Cases

Extracting Main Content

const content = parseContent(doc);

const title = content.title;
const description = content.paragraphs.join(" ");
const image = content.banner?.url;

Processing Content Sections

const content = parseContent(doc);

// Main content
console.log("Title:", content.title);
console.log("Description:", content.paragraphs);

// Additional content groups
content.items.forEach(item => {
  console.log("Section:", item.title);
  console.log("Content:", item.paragraphs);
});

Sequential Processing

const { sequence } = parseContent(doc);

sequence.forEach(element => {
  switch(element.type) {
    case 'heading':
      renderHeading(element);
      break;
    case 'paragraph':
      renderParagraph(element);
      break;
    case 'image':
      renderImage(element);
      break;
  }
});

Content Grouping

The parser supports two grouping modes:

Heading-Based Grouping (Default)

Groups are created based on heading patterns. A new group starts when:

A heading follows content
Multiple H1s appear (no main content created)
The heading level indicates a new section

Pretitle Detection: Any heading followed by a more important heading is automatically detected as a pretitle:

H3 before H1 → pretitle ✅
H2 before H1 → pretitle ✅
H6 before H5 → pretitle ✅
H4 before H2 → pretitle ✅

No configuration needed - it just works naturally!

Divider-Based Grouping

When any horizontal rule (---) is present, the entire document uses divider-based grouping. Groups are split explicitly by dividers.

Text Formatting

Inline formatting is preserved as HTML tags:

// Input: Text with bold mark
// Output: "Text with <strong>bold</strong>"

// Input: Text with italic mark
// Output: "Text with <em>emphasis</em>"

// Input: Link mark
// Output: "Click <a href=\"/docs\">here</a>"

// Input: Span mark (bracketed spans)
// Output: "This is <span class=\"highlight\">highlighted</span> text"

Span Marks

Bracketed spans ([text]{.class}) are converted to  elements with their attributes:

// Input mark
{ type: "span", attrs: { class: "highlight", id: "note-1" } }

// Output HTML
'<span class="highlight" id="note-1">text</span>'

Spans can have classes, IDs, and custom attributes. They combine with other marks—a span with bold becomes text.

Use Cases

Component-based websites: Extract structured data for React/Vue components
Content management: Parse editor content into database-friendly structures
Static site generation: Transform rich content into template-ready data
Content analysis: Analyze document structure and content types

License

GPL-3.0-or-later

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.claude		.claude
src		src
tests		tests
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmignore		.npmignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@uniweb/semantic-parser

What it Does

Installation

Quick Start

Output Structure

Sequence View

Content Structure

Common Use Cases

Extracting Main Content

Processing Content Sections

Sequential Processing

Content Grouping

Heading-Based Grouping (Default)

Divider-Based Grouping

Text Formatting

Span Marks

Use Cases

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@uniweb/semantic-parser

What it Does

Installation

Quick Start

Output Structure

Sequence View

Content Structure

Common Use Cases

Extracting Main Content

Processing Content Sections

Sequential Processing

Content Grouping

Heading-Based Grouping (Default)

Divider-Based Grouping

Text Formatting

Span Marks

Use Cases

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages