Skip to content

generate_paper_citations

src.generators.generate_paper_citations

Generate paper citation counts from Google Scholar.

Uses a single source (Google Scholar via the scholarly library) to ensure all citation counts are comparable. Results are cached to disk so the enricher can be run in batches — if Scholar blocks us we stop, and on the next run cached papers are skipped automatically.

Reads

assets/data/artifacts.json — paper titles, conferences, badges

Outputs

assets/data/paper_citations.json — per-paper citation data assets/data/paper_citations_summary.json — aggregate summary

Usage

Full run (will stop gracefully if blocked):

python3 -m src.generators.generate_paper_citations \ --data_dir ../reprodb.github.io

Report what's cached without making any API calls:

python3 -m src.generators.generate_paper_citations \ --data_dir ../reprodb.github.io --cache_only

Custom cache TTL (default: 90 days):

python3 -m src.generators.generate_paper_citations \ --data_dir ../reprodb.github.io --cache_ttl_days 90

scholar_lookup(title: str) -> dict | None

Query Google Scholar for citation count. Returns result dict or None.

Source code in src/generators/generate_paper_citations.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def scholar_lookup(title: str) -> dict | None:
    """Query Google Scholar for citation count. Returns result dict or None."""
    from scholarly import scholarly

    time.sleep(SCHOLAR_DELAY)
    results = scholarly.search_pubs(title)
    pub = next(results, None)
    if not pub:
        return None

    # Verify match quality (Jaccard ≥ 0.5 on word sets)
    bib = pub.get("bib", {})
    pub_title = bib.get("title", "")
    norm_q = set(normalize_title(title).split())
    norm_r = set(normalize_title(pub_title).split())
    if norm_q and norm_r:
        jaccard = len(norm_q & norm_r) / len(norm_q | norm_r)
        if jaccard < 0.5:
            return None

    return {
        "cited_by_count": pub.get("num_citations", 0),
        "scholar_title": pub_title,
        "scholar_year": bib.get("pub_year"),
    }