Skip to content

ReproDB Pipeline

collect_artifact_stats

reprodb-pipeline

collect_artifact_stats¶

`src.utils.collection.collect_artifact_stats` ¶

Thin wrappers around cached repository-stats functions.

Provides a unified API for fetching GitHub/Zenodo/Figshare stats and attaching them to artifact dictionaries in-place. All HTTP work is delegated to the cached helpers in scrapers.repo_utils.

`get_all_artifact_stats(results, url_keys)` ¶

Fetch repository stats and attach them to each artifact entry in-place.

For each artifact with a valid URL (GitHub/Zenodo/Figshare), fetches metadata and merges it into artifact['stats'].

Source code in src/utils/collection/collect_artifact_stats.py

def get_all_artifact_stats(results, url_keys):
    """Fetch repository stats and attach them to each artifact entry in-place.

    For each artifact with a valid URL (GitHub/Zenodo/Figshare), fetches
    metadata and merges it into ``artifact['stats']``.
    """
    for name, artifacts in results.items():
        for url_key in url_keys:
            logger.info(f"Getting stats for {len(artifacts)}")
            for artifact in artifacts:
                url = artifact.get(url_key, "")
                if not url or not artifact.get(url_key + "_exists"):
                    logger.info(f"{url_key} does not exist for {artifact.get('title', '?')} at {name}")
                    continue

                if "zenodo" in url:
                    stats = zenodo_stats(url)
                elif "figshare" in url:
                    stats = figshare_stats(url)
                elif "github" in url:
                    stats = github_stats(url)
                else:
                    logger.info(f"No stats for {url} at {name} titled {artifact.get('title', '?')}")
                    continue

                if stats:
                    artifact["stats"] = {**stats, **artifact.get("stats", {})}

    return results