Add an openai based parser for the saami pdf files#13
Conversation
|
Thanks for creating this, good time to have the discussion. I'm personally a fan of leveraging AI models to create simple scripts and programs, and it's possible that this might be a good fit. The trouble of course is that if SAAMI decides to change their PDF format, then this would no longer work. Also, we humans are still responsible for reviewing the results of the script each time for correctness. So naturally I wonder if it's easier to simply spend the human time/energy on manually adding the details from SAAMI when they make changes (roughly once per year). Is that more or less time spent than running the script, and reviewing all of the results for correctness, and possibly having to make changes to the script? |
|
Yeah there are definitely some tradeoffs here. I think this code should be fairly agnostic to the exact PDF layout. Some of the cartridge pages in the document already differs from each other and the AI seems to be pretty good at "reading" it. That said there are a couple of hard coded bits in the PR like where the pages for cartridges can be found in the PDF. That could probably be improved, but I didn't want to spend too much time on it up front. It would also not be too much work for a human to provide these as arguments if the PDF is only updated once a year. My general attitude is that since there were very few details in the JSON files in the repo right now, providing more would be an improvement that can be built on. I can set aside some time to manually review the JSON output in this PR against the PDF to make sure that they are reasonable. I would assume that changes to existing published cartridge specs are rare (if ever?) so then we have a baseline to add to when the PDF is updated with new catridges. |
|
I finally had a moment to review the output json. I've updated the file to match the pdf. The numbers should now be correct and I adjusted the names to be a bit less screamy as well. The openai output was probably less than 50% correct. It has a tendency to round the numbers (and it seems to pick up the diameter from the cartridge name instead of from the drawing). |
I made an attempt at parsing this report:
https://saami.org/wp-content/uploads/2023/11/ANSI-SAAMI-Z299.4-CFR-Approved-2015-12-14-Posting-Copy.pdfIt contains center fire rifle cartridges. Since the pdfs have drawings in them and differ a bit I figured this was a good opportunity to learn some OpenAI.
The current version will do the following:
See the final output in saami.json. It's not too bad considering the input data. I have not done an in depth review to verify against the source material however. I'm also only extracting the main data: name, caliber and coal. I figure that's a good start.