-
Notifications
You must be signed in to change notification settings - Fork 1
Release v0.5 #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Release v0.5 #276
Changes from all commits
Commits
Show all changes
160 commits
Select commit
Hold shift + click to select a range
4bab5e5
Add preliminary reference corpus classes to consolidate existing logic
rlskoeser 8295b4d
Add methods to compile and save metadata; add other corpus
rlskoeser 0b8a211
Document new fields for reference poetry corpora paths
rlskoeser a15cfae
A minimal compile dataset script; reference corpus metadata only for now
rlskoeser 1adeb26
Add shared base class for local text reference corpora
rlskoeser d9e22de
Refactor ref_corpus metadata to always return a polars dataframe
rlskoeser 0b991fd
Add unit tests for internet poems reference corpus code
rlskoeser f8cc7b3
Add unit tests for OtherPoems and all/fulltext convenience methods
rlskoeser 62a11bb
Add tests for chadwyck-healey ref corpus class
rlskoeser 7985a3d
Add test for compile metadata method
rlskoeser b05d5ef
Update so tests don't pass due to local config + data
rlskoeser 8adc69e
Use shadow-dataset paths by default, with allowed override
rlskoeser 056c342
Fix metadata path configuration
rlskoeser e9f5430
Add test for save_poem_metadata method
rlskoeser 83656fe
Update example config to match changed name for compiled dataset
rlskoeser 1ca2aa4
Update sample_config.yml
rlskoeser 9f0cd1d
Update src/corppa/poetry_detection/ref_corpora.py
rlskoeser 95b9249
Add PPA work-level methods & unit tests (#218)
laurejt 1f66cc1
Add tests for config error handling
rlskoeser 76157e2
Add ref_corpora module to sphinx docs
rlskoeser c8315f7
Simplify config file for ref corpora
rlskoeser e170c1e
Simplify polars metadata schema handling per @laurejt review
rlskoeser ff47e0c
Handle default config values & overrides
rlskoeser e0d6d92
Update tests work with revised configuration behavior
rlskoeser d52d730
Update InternetPoems meta to support reading from tar file
rlskoeser 02c3c70
Raise NotImplementedError if you try to generate text from a tar file
rlskoeser f2d45ad
Update example configuration
rlskoeser 2e0b2bd
Remove conditional path logic in text fixture that's no longer needed
rlskoeser cda9eaf
Test relative path config and missing ingredient dir error
rlskoeser a2a6f38
Refactor merge_excerpts so it can be used in compile dataset
rlskoeser d140fd0
Build out more of the compile-dataset script
rlskoeser dc1ee7d
Add ppa works to compilation; move validation & path to compilation
rlskoeser 482ab3f
Add note about compressing excerpts
rlskoeser 92c1394
Attempt to add compression step (not working 😕)
rlskoeser 6e990e9
Update save_poem_metadata test to supply an output file parameter
rlskoeser 08a9fb5
Fix excerpt compression
rlskoeser 99befe4
Merge branch 'release/0.4' into develop
rlskoeser d3ad10b
Update develop version to 0.5-dev0 for next release
rlskoeser 68d7e8d
Merge branch 'main' into develop
rlskoeser 03a6882
Added marimo support
laurejt 739d8ba
Fixed polar utils to work w/ current excerpt data
laurejt ae6c477
Added marimo notebook for EoP
laurejt 7f7886e
Merge pull request #229 from Princeton-CDH/feature/compile-dataset
rlskoeser 3e106c3
Drop python 3.11 from pyproject and unit text matrix; add 3.13
rlskoeser 22ed5fe
Remove conditional dependency for python 3.11 support
rlskoeser 1470343
Update BioPython args per deprecation warnings
rlskoeser 579435a
Switch unit tests workflow to uv for python management
rlskoeser fa0b9ec
Add pre-commit to check github actions are valid before committing
rlskoeser cf897a0
Fix formatting in github action workflow file
rlskoeser 7fb28cb
Minor cleanup from pre-commit hook run
rlskoeser b6c1dab
Make test assertion order-independent
rlskoeser eab628e
Switch notebook check workflow to uv as well
rlskoeser b995f6e
Document changes in this pr
rlskoeser 79944b6
Merge pull request #245 from Princeton-CDH/feature/drop-3.11
rlskoeser cfd9be5
Rename unit test folder to the more standard tests
rlskoeser 3b0e28e
Merge pull request #244 from Princeton-CDH/feature/rename-tests
rlskoeser e852509
Make compile dataset and sample config agree on field names
rlskoeser 668c5f1
Configure compile dataset as a package script
rlskoeser cb7520f
Merge branch 'develop' into feature/ref-corpus-data
rlskoeser 60917e4
Exclude untested portions of gvision ocr script from coverage report
rlskoeser 2961594
Exclude more untested code from coverage reports
rlskoeser 360f456
Merge pull request #228 from Princeton-CDH/feature/ref-corpus-data
rlskoeser 65599e4
Merge branch 'develop' into feature/eop-play
rlskoeser 5622e70
Update & reconcile metadata field names in polars utils and tests
rlskoeser 82fa8dc
Specify biopython version for change in parameter names
rlskoeser d088a22
Clean up polars utils, reduce redundancy, update tests
rlskoeser 38d5df3
Update notebook to use revised polars utils & config
rlskoeser 055c225
Don't rename ppa metadata fields when compiling dataset
rlskoeser be020a6
Merge all spans when ppa span start/end matches exactly
rlskoeser c790b94
Merge equivalent spans regardless of method or poem id
rlskoeser fe0c29b
Fix some typos
rlskoeser 95be347
Rename/simply config variable for compiled dataset data dir
rlskoeser 36f505c
Add a test case for subsetting/selecting fields when loading metadata
rlskoeser 4013af9
Merge pull request #254 from Princeton-CDH/feature/eop-play
rlskoeser 400a08e
Add work-level totals for excerpts, poem ids when compiling dataset
rlskoeser d5716b2
Convert ppa collections to list when loading as dataframe
rlskoeser 6ed19f2
Script to subset excerpts for exploration/analysis (wip)
rlskoeser 5854a4c
Add & populate alt_poem_ids field for merged excerpts
rlskoeser 0423fd4
Merge branch 'develop' into feature/merge-exact-spans
rlskoeser 2d5fedc
Preliminary notebook for reviewing merged excerpts
rlskoeser 7628f4e
Prioritize longer passim matches when merging
rlskoeser 31b73e2
Update poem meta method to work with alt poem ids
rlskoeser b7167e6
Expand notebook to look at poem ids that are collapsed
rlskoeser 8158807
Update merge logic documentation
rlskoeser c18d175
Reconcile local and molab versions of excerpt viewer notebook
rlskoeser 5740ab6
Expand note on passim match sorting; add note for poem meta suffix arg
rlskoeser e9b90a2
Merge pull request #256 from Princeton-CDH/feature/merge-exact-spans
rlskoeser ae8dbc9
Merge pull request #259 from Princeton-CDH/feature/annotation-viewer
rlskoeser d85e568
Merge branch 'develop' into feature/aggregate-counts
rlskoeser 9558236
Add aggregate excerpt/work counts to compiled poem/ppa metadata
rlskoeser 9b95a6b
Document subset excerpt script logic & use
rlskoeser 6afe22d
Add experimental grist import script for documentation purposes
rlskoeser fe96ff0
Merge pull request #262 from Princeton-CDH/feature/foundpoems-to-grist
rlskoeser 693afe2
Update test for change to ppa works collection field as list
rlskoeser 0b1f231
Merge pull request #261 from Princeton-CDH/feature/subset-excerpts
rlskoeser 3f1fc9e
Add notebook & session data analyzing percent of PPA detected as poetry
rlskoeser 6d24609
See if marimo.ui.chart makes chart render in static version
rlskoeser 3004d52
Use mo.ui.chart so all charts render in molab static preview
rlskoeser b50910b
Add brief explanatory text to the sections
rlskoeser 29d7f1a
Calculate & include poem length when compiling poem metadata
rlskoeser b670cf9
Update notebooks/ppa-percent-poetry.py
rlskoeser 57e4c98
Add support for .tar.gz ref corpus for loading & generating counts
rlskoeser 9dac448
Update tests & default config for change in ref corpora text path option
rlskoeser 2888ddb
Update unit tests for change to poem metadata aggregate info
rlskoeser d60648a
Add a note explaining why we don't have a ppa author aggregate count
rlskoeser c89a78a
Update percent poetry analysis notebook to address spans, add stats
rlskoeser e6dac5d
Remove unused normalize toggle
rlskoeser 04b101d
Adjust logic for running all steps in sequence
rlskoeser 5dab476
Merge pull request #265 from Princeton-CDH/feature/ppa-percent-poetry
rlskoeser fd5700c
Test poetry excerpt aggregation logic
rlskoeser 4b107ac
Test poem length calculation
rlskoeser e8088ac
Test unsupported path error for get_text_corpus
rlskoeser ff192e7
Refactor duplicate poem length calculation and test explicitly
rlskoeser 7cb05c0
Use test config everywhere to avoid overwriting real data with tests
rlskoeser f38869c
Unit tests for compile dataset methods
rlskoeser 800601b
Refactor main method to simplify testing and add unit test
rlskoeser 2291972
Add test for compress file method
rlskoeser 651612d
Add docstrings to methods
rlskoeser da0d863
Fix arg handling for running as a script
rlskoeser 08735c9
Clean up based on PR review
rlskoeser 2c96404
Merge pull request #260 from Princeton-CDH/feature/aggregate-counts
rlskoeser a13332a
Include poem length measurements when loading poem metadata
rlskoeser 68b7d23
Exploration of poem and excerpt length #264
rlskoeser 559e2c2
See if alt.layer works for compound charts in marimo static preview
rlskoeser 1a605c2
Adjust layered charts to render in preview; display most quoted poems
rlskoeser 5cb38e1
Refactor poem excerpt length charts, add graph for percent of poem
rlskoeser 46cab33
Fix logic for first ppa appearance, add longest poem for each decade
rlskoeser d48e796
Add a comment noting where custom box plot was adapted from
rlskoeser 27e325e
Check and filter excerpts by poet birth year & PPA publication year
rlskoeser 6efc35b
Merge pull request #266 from Princeton-CDH/feature/poem-excerpt-length
rlskoeser 074661f
new notebook for exploring wikidata-reconciled poetry metadata (needs…
WHaverals 5833bc4
Method to find overlapping excerpts; reusable logic for merging groups
rlskoeser 317ca92
Add unit tests for method for identify overlapping excerpts
rlskoeser a617b0b
Apply suggestions from code review
rlskoeser 4ff6ca8
Fix typo in comment
rlskoeser 4b83c42
Simplify logic for filtering overlapping excerpt pairs
rlskoeser 4b519b1
Clean up & clarify comments for merge, remove unneeded code & sort
rlskoeser 8592b93
Split code docs out into one file per top-level module
rlskoeser bbf5e38
Add core objects to docs & document fields
rlskoeser d23bedc
Improve docs for merge excerpts code
rlskoeser 6339d5b
Update years in docs copyright statement
rlskoeser 99ca067
Fix formatting in docstring
rlskoeser 757935b
Apply suggestions from code review
rlskoeser 33be42e
Add license document to sphinx docs
rlskoeser 5f9e59d
Clean up & additional test cases per @laurejt review
rlskoeser a0d336c
Merge pull request #272 from Princeton-CDH/feature/find-overlapping-e…
rlskoeser 7ceffc2
Revise found poems compilation in preparation for publishing v0.5 (#274)
rlskoeser 8b68381
Set version to 0.5 and update changelog
rlskoeser ead871c
Expand changelog and remove outdated todo
rlskoeser 5f75b75
Add author information to pyproject.
rlskoeser 1babbac
Update CHANGELOG.md
rlskoeser 783af00
Update CHANGELOG.md
rlskoeser e0d2bf7
Update .pre-commit-config.yaml
rlskoeser feeb8b8
Update CHANGELOG.md
rlskoeser 5d9342d
Add more details to 0.5 changelog; add dates & links for all versions
rlskoeser 323a1bc
Revise version & date format for readability
rlskoeser 728f773
Add tests for revised run_passim default arguments
rlskoeser bd7be8b
Add unit test for text corpus from tarfile to testing directly
rlskoeser cafa1a6
Remove python 3.11 specific code for walking directories
rlskoeser 27f5f45
Remove use of os.path from path utils
laurejt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| orphan: true | ||
| --- | ||
| ``` | ||
| {include} ../../LICENSE.md | ||
| ``` | ||
|
|
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| Annotation | ||
| ########## | ||
|
|
||
| Data Preparation | ||
| ================ | ||
|
|
||
| Preliminary Page Set Creation | ||
| ------------------------------ | ||
| .. automodule:: corppa.poetry_detection.annotation.create_pageset | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Add Metadata | ||
| ------------ | ||
| .. automodule:: corppa.poetry_detection.annotation.add_metadata | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Annotation Recipes | ||
| ================== | ||
| .. automodule:: corppa.poetry_detection.annotation.annotation_recipes | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Command Recipes | ||
| =============== | ||
| .. automodule:: corppa.poetry_detection.annotation.command_recipes | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Process Adjudication Data | ||
| ========================= | ||
| .. automodule:: corppa.poetry_detection.annotation.process_adjudication_data | ||
| .. Note: not including members for method docs, only top-level script usage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| Code Documentation | ||
| ################## | ||
|
|
||
| .. toctree:: | ||
| :maxdepth: 2 | ||
|
|
||
| ocr | ||
| utils | ||
| annotation | ||
| poetry-detection |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| OCR | ||
| ### | ||
|
|
||
| .. automodule:: corppa.ocr.gvision_ocr | ||
| :members: | ||
|
|
||
|
|
||
| Collate Texts | ||
| ============= | ||
| .. automodule:: corppa.ocr.collate_txt | ||
| .. Note: not including the members for the method docs, *but* we should we | ||
| .. make the top-level comment better. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| Poetry Detection | ||
| ################ | ||
|
|
||
| Core objects | ||
| ============ | ||
|
|
||
| .. automodule:: corppa.poetry_detection.core | ||
| :members: | ||
|
|
||
| Reference Corpora | ||
| ================= | ||
| .. automodule:: corppa.poetry_detection.ref_corpora | ||
| :members: | ||
|
|
||
|
|
||
|
|
||
| Scripts | ||
| ======= | ||
|
|
||
| refmatcha | ||
| --------- | ||
|
|
||
| .. automodule:: corppa.poetry_detection.refmatcha | ||
|
|
||
| Merge excerpts | ||
| -------------- | ||
|
|
||
| .. automodule:: corppa.poetry_detection.merge_excerpts | ||
| :members: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| Utils | ||
| ##### | ||
|
|
||
| Filter Utility | ||
| ============== | ||
| .. automodule:: corppa.utils.filter | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Path Utilities | ||
| ============== | ||
| .. automodule:: corppa.utils.path_utils | ||
| :members: | ||
|
|
||
| Generate PPA Page Set | ||
| ===================== | ||
| .. automodule:: corppa.utils.generate_page_set | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Add Image (Relative) Paths | ||
| ========================== | ||
| .. automodule:: corppa.utils.add_image_relpaths | ||
| .. Note: not including members for method docs, only top-level script usage | ||
|
|
||
| Build Text Corpus | ||
| ================= | ||
| .. automodule:: corppa.utils.build_text_corpus | ||
| .. Note: not including members for method docs, only top-level script usage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.