22 fetch metadata from apis by ClaireHzl · Pull Request #24 · dataforgoodfr/14_EUFactForce

ClaireHzl · 2026-03-16T13:09:33Z

Script that retrieves metadata for a specific article using various APIs based on its DOI and downloads its PDF if it is open access.

…r/14_EUFactForce into 22-fetch-metadata-from-apis

cgoudet

Merci pour ce travail!

Il faudrait tout mettre en "prod" plutot qu'en exploration.

cgoudet · 2026-03-23T08:10:33Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

Puisque l'on va les utiliser en prod, ces parsers doivent être dans la section prod et pas exploration du projet.

cgoudet · 2026-03-23T08:12:05Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+        return [article.pdf_url] if article else []
+
+
+if __name__ == "__main__":


A plutot mettre comme un test d'intégration mais mettre un skip pour qu'ils ne soit jamais lancé dans la CI.

cgoudet · 2026-03-23T08:13:17Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+ARXIV_DOI_PREFIX = "10.48550/arXiv."
+
+
+class ArxivMetadataParser(MetadataParser):


Je ne me souviens plus des discussions. PAs de téléchargement possible sur pdf sur arxiv?

Téléchargement possible sur Arxiv mais pas sur Pubmed à première vue

cgoudet · 2026-03-23T08:14:59Z

eu_fact_force/exploration/data_collection/parsers/base.py

+            return False
+        try:
+            for pdf_url in pdf_urls:
+                response = requests.get(pdf_url, timeout=30)


Pour un poil plus de clarté, peut être créer une fonction dédié pour télécharger 1 fichier.

Quand tu dis une fonction dédiée, tu parles d'une sous-fonction de cette fonction qui s'occupe uniquement du téléchargement en tant que tel (pour que la fonction soit moins complexe), ou de faire une fonction spécifique pour chaque classe fille ?

cgoudet · 2026-03-23T08:15:38Z

eu_fact_force/exploration/data_collection/parsers/base.py

+                if not response.content.startswith(b"%PDF"):
+                    print(f"Content at {pdf_url} is not a valid PDF (possibly a paywall page).")
+                    continue
+                with open(output_path, "wb") as f:


Si tu as plusieurs fichiers, ils vont tous s'écraser mutuellement et seul le dernier sera disponible.

Dans la fonction, on télécharge uniquement le pdf de la première url qui n'est pas une interface de paiement. Ca peut cependant s'écraser entre les différentes API, mais dans l'idée on ne veut qu'un seul pdf par DOI non ?

cgoudet · 2026-03-23T08:15:53Z

eu_fact_force/exploration/data_collection/parsers/base.py

+                return True
+            return False
+        except Exception as e:
+            print(f"Download failed: {e}")


logging à la place de print.

cgoudet · 2026-03-23T08:17:28Z

eu_fact_force/ingestion/data_collection/parsers/crossref.py

+            return []
+
+
+if __name__ == "__main__":


Pareil, mettre ca dans un TU

cgoudet · 2026-03-23T08:20:31Z

eu_fact_force/ingestion/data_collection/main.py

+    return {"found": bool(sources), "sources": sources} | merged
+
+
+if __name__ == "__main__":


Ici ce serait mieux d'intégrer ca directement dans fetch_file_and_metadata de services.py

cgoudet · 2026-03-23T08:21:12Z

eu_fact_force/ingestion/data_collection/README.md

+## Usage
+
+```bash
+python3 main.py --doi 10.1128/mbio.01735-25


C'est une méthode d'exploration pour l'exploration, pas pour la prod qui doit utiliser l'API

pyproject.toml

ClaireHzl and others added 5 commits March 10, 2026 19:29

List the metadata for each API.

1307b04

Add first version of fetching from api to metadata dictionnary.

34777f8

Add of metadata.

995a30a

Fix new metadata error.

3eff7d3

Add the pdf downloading.

20e9f66

ClaireHzl linked an issue Mar 16, 2026 that may be closed by this pull request

Fetch metadata from apis. #22

Open

ClaireHzl marked this pull request as draft March 16, 2026 13:10

ClaireHzl and others added 10 commits March 17, 2026 19:38

Add HAL class.

d674a33

Add pubmed api calls.

3af0303

Simplify parser.

0df0900

Merge branch '22-fetch-metadata-from-apis' of github.com:dataforgoodf…

f8bf7f4

…r/14_EUFactForce into 22-fetch-metadata-from-apis

Add pubmed and openalex parsers.

89d749c

Add document type and doi arg for pubmed parser.

2ac41f6

Add download of pdf.

6fd7fe6

Add main and group parsers.

8173e8c

Fix typo issues.

4545893

Merge branch 'main' into 22-fetch-metadata-from-apis

c17e38a

cgoudet requested changes Mar 23, 2026

View reviewed changes

ClaireHzl and others added 6 commits March 23, 2026 17:50

Move into prod.

cd2b0c4

Add api name attribute.

5f0a3c9

Add arxiv lib in dependencies.

7a00bae

Update testing doi for openalex, crossref and pubmed.

b94beed

Integration in services.

d30cb56

Update Readme.

e5ec68f

		return [article.pdf_url] if article else []


		if __name__ == "__main__":

		ARXIV_DOI_PREFIX = "10.48550/arXiv."


		class ArxivMetadataParser(MetadataParser):

		return {"found": bool(sources), "sources": sources} \| merged


		if __name__ == "__main__":

Conversation

ClaireHzl commented Mar 16, 2026

Uh oh!

cgoudet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaireHzl Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ClaireHzl Mar 23, 2026 •

edited

Loading