@@ -11,7 +11,8 @@ Simple streaming JSON parser and encoder.
1111When [ reading] ( #reading ) JSON data, ` json-stream ` can decode JSON data in
1212a streaming manner, providing a pythonic dict/list-like interface, or a
1313[ visitor-based interface] ( #visitor ) . It can stream from files, [ URLs] ( #urls )
14- or [ iterators] ( #iterators ) . It can process [ multiple JSON documents] ( #multiple ) in a single stream.
14+ or [ iterators] ( #iterators ) . It can process [ multiple JSON documents] ( #multiple )
15+ in a single stream, and can read JSON [ mixed with other non-JSON data] ( #reading-mixed-data ) .
1516
1617When [ writing] ( #writing ) JSON data, ` json-stream ` can stream JSON objects
1718as you generate them.
@@ -495,13 +496,81 @@ significant parsing speedup compared to pure python implementation.
495496` json-stream ` will fallback to its pure python tokenizer implementation
496497if ` json-stream-rs-tokenizer ` is not available.
497498
499+ #### <a id =" reading-mixed-data " ></a > Reading mixed data
500+
501+ When using the Rust tokenizer, you can also use ` json-stream ` to parse mixed
502+ data, for example a file with a JSON followed by binary data.
503+
504+ To do this, you should pass ` correct_cursor=True ` to ` load() ` . The ensures the
505+ rust tokenizer keeps track of the exact stream position it has read up to. This
506+ comes with a ** significant performance cost** for un-seekable streams.
507+
508+ After reading the JSON data, call ` read_all() ` on the top-level object returned
509+ by ` load() ` to ensure you are at the end of the JSON data, and then call
510+ ` .tokenizer.park_cursor() ` "park" the underlying file cursor at the correct
511+ position.
512+
513+ ``` python
514+ import json_stream
515+
516+ with open (' test.bin' , ' rb' ) as f:
517+ # read JSON header
518+ header = json_stream.load(f, correct_cursor = True )
519+ # ... process JSON header ...
520+ header.read_all()
521+
522+ # ensure the tokenizer has "parked" the file
523+ # cursor at the end of the JSON data
524+ header.tokenizer.park_cursor()
525+
526+ # now we can read binary data from the same file
527+ binary_start = f.tell()
528+ data = f.read()
529+
530+ # ### <a id="mixed-scenarios"></a> Other mixed data scenarios
531+
532+ `json- stream` can also handle streams that start with binary data, or have binary
533+ data between multiple JSON documents.
534+
535+ # #### Binary then JSON
536+
537+ You can simply read the binary data from the file before calling `load()` .
538+
539+ ```python
540+ with open (' test.bin' , ' rb' ) as f:
541+ binary_data = f.read(1024 )
542+ data = json_stream.load(f)
543+ # ... process JSON ...
544+ ```
545+
546+ ##### JSON then binary then JSON
547+
548+ You must use ` correct_cursor=True ` for any JSON document that is followed by
549+ binary data.
550+
551+ ``` python
552+ with open (' test.bin' , ' rb' ) as f:
553+ # 1. Read first JSON
554+ data1 = json_stream.load(f, correct_cursor = True )
555+ # ... process data1 ...
556+ data1.read_all()
557+ data1.tokenizer.park_cursor()
558+
559+ # 2. Read binary data
560+ binary_data = f.read(1024 )
561+
562+ # 3. Read second JSON
563+ data2 = json_stream.load(f)
564+ # ... process data2 ...
565+ ```
566+
498567### Custom tokenizer
499568
500569You can supply an alternative JSON tokenizer implementation. Simply pass
501570a tokenizer to the ` load() ` or ` visit() ` methods.
502571
503572``` python
504- json_stream.load(f, tokenizer = some_tokenizer)
573+ json_stream.load(f, tokenizer = some_tokenizer, tokenizer_kwargs = ... )
505574```
506575
507576The requests methods also accept a customer tokenizer parameter.
0 commit comments