Skip to content

Commit 1b915be

Browse files
committed
#70 simplify and document support for reading streams with mixed JSON and binary data
1 parent dfb4b70 commit 1b915be

4 files changed

Lines changed: 83 additions & 10 deletions

File tree

README.md

Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Simple streaming JSON parser and encoder.
1111
When [reading](#reading) JSON data, `json-stream` can decode JSON data in
1212
a streaming manner, providing a pythonic dict/list-like interface, or a
1313
[visitor-based interface](#visitor). It can stream from files, [URLs](#urls)
14-
or [iterators](#iterators). It can process [multiple JSON documents](#multiple) in a single stream.
14+
or [iterators](#iterators). It can process [multiple JSON documents](#multiple)
15+
in a single stream, and can read JSON [mixed with other non-JSON data](#reading-mixed-data).
1516

1617
When [writing](#writing) JSON data, `json-stream` can stream JSON objects
1718
as you generate them.
@@ -495,13 +496,81 @@ significant parsing speedup compared to pure python implementation.
495496
`json-stream` will fallback to its pure python tokenizer implementation
496497
if `json-stream-rs-tokenizer` is not available.
497498

499+
#### <a id="reading-mixed-data"></a> Reading mixed data
500+
501+
When using the Rust tokenizer, you can also use `json-stream` to parse mixed
502+
data, for example a file with a JSON followed by binary data.
503+
504+
To do this, you should pass `correct_cursor=True` to `load()`. The ensures the
505+
rust tokenizer keeps track of the exact stream position it has read up to. This
506+
comes with a **significant performance cost** for un-seekable streams.
507+
508+
After reading the JSON data, call `read_all()` on the top-level object returned
509+
by `load()` to ensure you are at the end of the JSON data, and then call
510+
`.tokenizer.park_cursor()` "park" the underlying file cursor at the correct
511+
position.
512+
513+
```python
514+
import json_stream
515+
516+
with open('test.bin', 'rb') as f:
517+
# read JSON header
518+
header = json_stream.load(f, correct_cursor=True)
519+
# ... process JSON header ...
520+
header.read_all()
521+
522+
# ensure the tokenizer has "parked" the file
523+
# cursor at the end of the JSON data
524+
header.tokenizer.park_cursor()
525+
526+
# now we can read binary data from the same file
527+
binary_start = f.tell()
528+
data = f.read()
529+
530+
#### <a id="mixed-scenarios"></a> Other mixed data scenarios
531+
532+
`json-stream` can also handle streams that start with binary data, or have binary
533+
data between multiple JSON documents.
534+
535+
##### Binary then JSON
536+
537+
You can simply read the binary data from the file before calling `load()`.
538+
539+
```python
540+
with open('test.bin', 'rb') as f:
541+
binary_data = f.read(1024)
542+
data = json_stream.load(f)
543+
# ... process JSON ...
544+
```
545+
546+
##### JSON then binary then JSON
547+
548+
You must use `correct_cursor=True` for any JSON document that is followed by
549+
binary data.
550+
551+
```python
552+
with open('test.bin', 'rb') as f:
553+
# 1. Read first JSON
554+
data1 = json_stream.load(f, correct_cursor=True)
555+
# ... process data1 ...
556+
data1.read_all()
557+
data1.tokenizer.park_cursor()
558+
559+
# 2. Read binary data
560+
binary_data = f.read(1024)
561+
562+
# 3. Read second JSON
563+
data2 = json_stream.load(f)
564+
# ... process data2 ...
565+
```
566+
498567
### Custom tokenizer
499568

500569
You can supply an alternative JSON tokenizer implementation. Simply pass
501570
a tokenizer to the `load()` or `visit()` methods.
502571

503572
```python
504-
json_stream.load(f, tokenizer=some_tokenizer)
573+
json_stream.load(f, tokenizer=some_tokenizer, tokenizer_kwargs=...)
505574
```
506575

507576
The requests methods also accept a customer tokenizer parameter.

src/json_stream/base.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@ def __init__(self, token_stream):
3636
self._stream = token_stream
3737
self._child: Optional[StreamingJSONBase] = None
3838

39+
@property
40+
def tokenizer(self):
41+
return self._stream
42+
3943
def _clear_child(self):
4044
if self._child is not None:
4145
self._child.read_all()

src/json_stream/loader.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33
from json_stream.select_tokenizer import default_tokenizer
44

55

6-
def load(fp_or_iterable, persistent=False, tokenizer=default_tokenizer):
7-
return next(load_many(fp_or_iterable, persistent, tokenizer))
6+
def load(fp_or_iterable, persistent=False, tokenizer=default_tokenizer, **tokenizer_kwargs):
7+
return next(load_many(fp_or_iterable, persistent, tokenizer, **tokenizer_kwargs))
88

99

10-
def load_many(fp_or_iterable, persistent=False, tokenizer=default_tokenizer):
10+
def load_many(fp_or_iterable, persistent=False, tokenizer=default_tokenizer, **tokenizer_kwargs):
1111
fp = ensure_file(fp_or_iterable)
12-
token_stream = tokenizer(fp)
12+
token_stream = tokenizer(fp, **tokenizer_kwargs)
1313
for token_type, token in token_stream:
1414
if token_type == TokenType.OPERATOR:
1515
data = StreamingJSONBase.factory(token, token_stream, persistent)

src/json_stream/visitor.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ def _visit(obj, visitor, path):
2020
visitor(obj, path)
2121

2222

23-
def visit_many(fp_or_iterator, visitor, tokenizer=default_tokenizer):
23+
def visit_many(fp_or_iterator, visitor, tokenizer=default_tokenizer, **tokenizer_kwargs):
2424
fp = ensure_file(fp_or_iterator)
25-
token_stream = tokenizer(fp)
25+
token_stream = tokenizer(fp, **tokenizer_kwargs)
2626
for token_type, token in token_stream:
2727
if token_type == TokenType.OPERATOR:
2828
obj = StreamingJSONBase.factory(token, token_stream, persistent=False)
@@ -33,5 +33,5 @@ def visit_many(fp_or_iterator, visitor, tokenizer=default_tokenizer):
3333
yield
3434

3535

36-
def visit(fp_or_iterator, visitor, tokenizer=default_tokenizer):
37-
next(visit_many(fp_or_iterator, visitor, tokenizer))
36+
def visit(fp_or_iterator, visitor, tokenizer=default_tokenizer, **tokenizer_kwargs):
37+
next(visit_many(fp_or_iterator, visitor, tokenizer, **tokenizer_kwargs))

0 commit comments

Comments
 (0)