Skip to content

Commit e00c52e

Browse files
Merge branch 'main' into loss_trailing_zeros_formatted_num
Signed-off-by: Luca Selvaggio <[email protected]>
2 parents 3d9cc21 + aa5c668 commit e00c52e

File tree

166 files changed

+15110
-1671
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

166 files changed

+15110
-1671
lines changed

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,28 @@
1+
## [v2.52.0](https://github.com/docling-project/docling-core/releases/tag/v2.52.0) - 2025-11-20
2+
3+
### Feature
4+
5+
* **experimental:** Add new DocTags serializer ([#412](https://github.com/docling-project/docling-core/issues/412)) ([`c9e5fb4`](https://github.com/docling-project/docling-core/commit/c9e5fb4a1ceb1ec0cae8ebae5f3eb844c0a2198a))
6+
* Convert regions into TableData ([#430](https://github.com/docling-project/docling-core/issues/430)) ([`c80b583`](https://github.com/docling-project/docling-core/commit/c80b58369c5bbb1be779a241fee146aa1b3a3685))
7+
8+
## [v2.51.1](https://github.com/docling-project/docling-core/releases/tag/v2.51.1) - 2025-11-14
9+
10+
### Fix
11+
12+
* Improve meta migration ([#422](https://github.com/docling-project/docling-core/issues/422)) ([`bc0e96b`](https://github.com/docling-project/docling-core/commit/bc0e96b9dc298d2e96ab2b4ce9faa4165d661b94))
13+
* DoclingDocument model validator should deal with any raw input ([#419](https://github.com/docling-project/docling-core/issues/419)) ([`56b3c42`](https://github.com/docling-project/docling-core/commit/56b3c42c61dbca7e9aa4a44fae18ecaadb482f81))
14+
15+
## [v2.51.0](https://github.com/docling-project/docling-core/releases/tag/v2.51.0) - 2025-11-12
16+
17+
### Feature
18+
19+
* Add code chunking functionality ([#398](https://github.com/docling-project/docling-core/issues/398)) ([`3097645`](https://github.com/docling-project/docling-core/commit/3097645198915a1258cfe6e1d5df3b5f1c79395a))
20+
21+
### Fix
22+
23+
* Improve meta migration and warning handling ([#417](https://github.com/docling-project/docling-core/issues/417)) ([`3d13b02`](https://github.com/docling-project/docling-core/commit/3d13b02756f1c0d1f1ccab5cfbd76f1f888a0dd9))
24+
* Fix import handling of extra dependencies for chunking ([#418](https://github.com/docling-project/docling-core/issues/418)) ([`567d3ad`](https://github.com/docling-project/docling-core/commit/567d3ada57e19b2a738991ae6e49d55dd3301b17))
25+
126
## [v2.50.1](https://github.com/docling-project/docling-core/releases/tag/v2.50.1) - 2025-11-04
227

328
### Fix

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Docling Core
22

33
[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
4-
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
4+
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue)
55
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
66
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
77
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
@@ -21,7 +21,7 @@ pip install docling-core
2121

2222
### Development setup
2323

24-
To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and uv. You can then install from your local clone's root dir:
24+
To develop for Docling Core, you need Python3.9 through 3.14 and the `uv` package. You can then install it from your local clone's root directory:
2525
```bash
2626
uv sync --all-extras
2727
```

docling_core/__init__.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Main package."""

docling_core/cli/view.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""CLI for docling viewer."""
2+
73
import importlib
84
import tempfile
95
import webbrowser
Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2025
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Experimental features."""
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
"""Define classes for DocTags serialization."""
2+
3+
from enum import Enum
4+
from typing import Any, Final, Optional
5+
from xml.dom.minidom import parseString
6+
7+
from pydantic import BaseModel
8+
from typing_extensions import override
9+
10+
from docling_core.transforms.serializer.base import (
11+
BaseDocSerializer,
12+
BaseMetaSerializer,
13+
BasePictureSerializer,
14+
BaseTableSerializer,
15+
SerializationResult,
16+
)
17+
from docling_core.transforms.serializer.common import create_ser_result
18+
from docling_core.transforms.serializer.doctags import (
19+
DocTagsDocSerializer,
20+
DocTagsParams,
21+
DocTagsPictureSerializer,
22+
DocTagsTableSerializer,
23+
_get_delim,
24+
_wrap,
25+
)
26+
from docling_core.types.doc import (
27+
BaseMeta,
28+
DescriptionMetaField,
29+
DocItem,
30+
DoclingDocument,
31+
MetaFieldName,
32+
MoleculeMetaField,
33+
NodeItem,
34+
PictureClassificationMetaField,
35+
PictureItem,
36+
SummaryMetaField,
37+
TableData,
38+
TabularChartMetaField,
39+
)
40+
from docling_core.types.doc.labels import DocItemLabel
41+
from docling_core.types.doc.tokens import DocumentToken
42+
43+
DOCTAGS_VERSION: Final = "1.0.0"
44+
45+
46+
class IDocTagsTableToken(str, Enum):
47+
"""Class to represent an LLM friendly representation of a Table."""
48+
49+
CELL_LABEL_COLUMN_HEADER = "<column_header/>"
50+
CELL_LABEL_ROW_HEADER = "<row_header/>"
51+
CELL_LABEL_SECTION_HEADER = "<shed/>"
52+
CELL_LABEL_DATA = "<data/>"
53+
54+
OTSL_ECEL = "<ecel/>" # empty cell
55+
OTSL_FCEL = "<fcel/>" # cell with content
56+
OTSL_LCEL = "<lcel/>" # left looking cell,
57+
OTSL_UCEL = "<ucel/>" # up looking cell,
58+
OTSL_XCEL = "<xcel/>" # 2d extension cell (cross cell),
59+
OTSL_NL = "<nl/>" # new line,
60+
OTSL_CHED = "<ched/>" # - column header cell,
61+
OTSL_RHED = "<rhed/>" # - row header cell,
62+
OTSL_SROW = "<srow/>" # - section row cell
63+
64+
65+
class IDocTagsParams(DocTagsParams):
66+
"""DocTags-specific serialization parameters."""
67+
68+
do_self_closing: bool = True
69+
pretty_indentation: Optional[str] = 2 * " "
70+
71+
72+
class IDocTagsMetaSerializer(BaseModel, BaseMetaSerializer):
73+
"""DocTags-specific meta serializer."""
74+
75+
@override
76+
def serialize(
77+
self,
78+
*,
79+
item: NodeItem,
80+
**kwargs: Any,
81+
) -> SerializationResult:
82+
"""DocTags-specific meta serializer."""
83+
params = IDocTagsParams(**kwargs)
84+
85+
elem_delim = ""
86+
texts = (
87+
[
88+
tmp
89+
for key in (
90+
list(item.meta.__class__.model_fields)
91+
+ list(item.meta.get_custom_part())
92+
)
93+
if (
94+
(
95+
params.allowed_meta_names is None
96+
or key in params.allowed_meta_names
97+
)
98+
and (key not in params.blocked_meta_names)
99+
and (tmp := self._serialize_meta_field(item.meta, key))
100+
)
101+
]
102+
if item.meta
103+
else []
104+
)
105+
if texts:
106+
texts.insert(0, "<meta>")
107+
texts.append("</meta>")
108+
return create_ser_result(
109+
text=elem_delim.join(texts),
110+
span_source=item if isinstance(item, DocItem) else [],
111+
)
112+
113+
def _serialize_meta_field(self, meta: BaseMeta, name: str) -> Optional[str]:
114+
if (field_val := getattr(meta, name)) is not None:
115+
if name == MetaFieldName.SUMMARY and isinstance(
116+
field_val, SummaryMetaField
117+
):
118+
txt = f"<summary>{field_val.text}</summary>"
119+
elif name == MetaFieldName.DESCRIPTION and isinstance(
120+
field_val, DescriptionMetaField
121+
):
122+
txt = f"<description>{field_val.text}</description>"
123+
elif name == MetaFieldName.CLASSIFICATION and isinstance(
124+
field_val, PictureClassificationMetaField
125+
):
126+
class_name = self._humanize_text(
127+
field_val.get_main_prediction().class_name
128+
)
129+
txt = f"<classification>{class_name}</classification>"
130+
elif name == MetaFieldName.MOLECULE and isinstance(
131+
field_val, MoleculeMetaField
132+
):
133+
txt = f"<molecule>{field_val.smi}</molecule>"
134+
elif name == MetaFieldName.TABULAR_CHART and isinstance(
135+
field_val, TabularChartMetaField
136+
):
137+
# suppressing tabular chart serialization
138+
return None
139+
# elif tmp := str(field_val or ""):
140+
# txt = tmp
141+
elif name not in {v.value for v in MetaFieldName}:
142+
txt = _wrap(text=str(field_val or ""), wrap_tag=name)
143+
return txt
144+
return None
145+
146+
147+
class IDocTagsPictureSerializer(DocTagsPictureSerializer):
148+
"""DocTags-specific picture item serializer."""
149+
150+
@override
151+
def serialize(
152+
self,
153+
*,
154+
item: PictureItem,
155+
doc_serializer: BaseDocSerializer,
156+
doc: DoclingDocument,
157+
**kwargs: Any,
158+
) -> SerializationResult:
159+
"""Serializes the passed item."""
160+
params = DocTagsParams(**kwargs)
161+
res_parts: list[SerializationResult] = []
162+
is_chart = False
163+
164+
if item.self_ref not in doc_serializer.get_excluded_refs(**kwargs):
165+
166+
if item.meta:
167+
meta_res = doc_serializer.serialize_meta(item=item, **kwargs)
168+
if meta_res.text:
169+
res_parts.append(meta_res)
170+
171+
body = ""
172+
if params.add_location:
173+
body += item.get_location_tokens(
174+
doc=doc,
175+
xsize=params.xsize,
176+
ysize=params.ysize,
177+
self_closing=params.do_self_closing,
178+
)
179+
180+
# handle tabular chart data
181+
chart_data: Optional[TableData] = None
182+
if item.meta and item.meta.tabular_chart:
183+
chart_data = item.meta.tabular_chart.chart_data
184+
if chart_data and chart_data.table_cells:
185+
temp_doc = DoclingDocument(name="temp")
186+
temp_table = temp_doc.add_table(data=chart_data)
187+
otsl_content = temp_table.export_to_otsl(
188+
temp_doc,
189+
add_cell_location=False,
190+
self_closing=params.do_self_closing,
191+
table_token=IDocTagsTableToken,
192+
)
193+
body += otsl_content
194+
res_parts.append(create_ser_result(text=body, span_source=item))
195+
196+
if params.add_caption:
197+
cap_res = doc_serializer.serialize_captions(item=item, **kwargs)
198+
if cap_res.text:
199+
res_parts.append(cap_res)
200+
201+
text_res = "".join([r.text for r in res_parts])
202+
if text_res:
203+
token = DocumentToken.create_token_name_from_doc_item_label(
204+
label=DocItemLabel.CHART if is_chart else DocItemLabel.PICTURE,
205+
)
206+
text_res = _wrap(text=text_res, wrap_tag=token)
207+
return create_ser_result(text=text_res, span_source=res_parts)
208+
209+
210+
class IDocTagsTableSerializer(DocTagsTableSerializer):
211+
"""DocTags-specific table item serializer."""
212+
213+
def _get_table_token(self) -> Any:
214+
return IDocTagsTableToken
215+
216+
217+
class IDocTagsDocSerializer(DocTagsDocSerializer):
218+
"""DocTags document serializer."""
219+
220+
picture_serializer: BasePictureSerializer = IDocTagsPictureSerializer()
221+
meta_serializer: BaseMetaSerializer = IDocTagsMetaSerializer()
222+
table_serializer: BaseTableSerializer = IDocTagsTableSerializer()
223+
params: IDocTagsParams = IDocTagsParams()
224+
225+
@override
226+
def _meta_is_wrapped(self) -> bool:
227+
return True
228+
229+
@override
230+
def serialize_doc(
231+
self,
232+
*,
233+
parts: list[SerializationResult],
234+
**kwargs: Any,
235+
) -> SerializationResult:
236+
"""DocTags-specific document serializer."""
237+
delim = _get_delim(params=self.params)
238+
text_res = delim.join([p.text for p in parts if p.text])
239+
240+
if self.params.add_page_break:
241+
page_sep = f"<{DocumentToken.PAGE_BREAK.value}{'/' if self.params.do_self_closing else ''}>"
242+
for full_match, _, _ in self._get_page_breaks(text=text_res):
243+
text_res = text_res.replace(full_match, page_sep)
244+
245+
wrap_tag = DocumentToken.DOCUMENT.value
246+
text_res = f"<{wrap_tag}><version>{DOCTAGS_VERSION}</version>{text_res}{delim}</{wrap_tag}>"
247+
248+
if self.params.pretty_indentation and (
249+
my_root := parseString(text_res).documentElement
250+
):
251+
text_res = my_root.toprettyxml(indent=self.params.pretty_indentation)
252+
text_res = "\n".join(
253+
[line for line in text_res.split("\n") if line.strip()]
254+
)
255+
return create_ser_result(text=text_res, span_source=parts)

docling_core/search/__init__.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Package for models and utility functions for search database mappings."""

docling_core/search/json_schema_to_search_mapper.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Methods to convert a JSON Schema into a search database schema."""
2+
73
import re
84
from copy import deepcopy
95
from typing import Any, Optional, Pattern, Tuple, TypedDict

docling_core/search/mapping.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Methods to define fields in an index mapping of a search database."""
2+
73
from typing import Any, Optional
84

95

docling_core/search/meta.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
#
2-
# Copyright IBM Corp. 2024 - 2024
3-
# SPDX-License-Identifier: MIT
4-
#
5-
61
"""Models and methods to define the metadata fields in database index mappings."""
2+
73
from pathlib import Path
84
from typing import Generic, Optional, TypeVar
95

0 commit comments

Comments
 (0)