acsac_scrape¶

`src.scrapers.acsac_scrape` ¶

Scrape ACSAC conference artifact evaluation pages for paper titles, badges, and artifact URLs.

ACSAC publishes artifact evaluation results at

https://www.acsac.org//program/artifacts/

The page groups papers under badge images (code_available, code_reviewed, code_reproducible) in bullet lists. Each bullet contains the paper title and one or more links to the artifact.

Usage as a library

from src.scrapers.acsac_scrape import scrape_acsac_artifacts artifacts = scrape_acsac_artifacts(2025)

Usage standalone

python -m src.scrapers.acsac_scrape --years 2024,2025

`scrape_acsac_artifacts(year, session=None)` ¶

Scrape the ACSAC artifact evaluation page for a given year.

Parameters:

Name	Type	Description	Default
`year`		Conference year (int).	required
`session`		Optional requests.Session to reuse.	`None`

Returns:

Type	Description
	List of dicts with keys: - title (str): paper title - badges (list[str]): e.g. ['available'] or ['available', 'reviewed'] - artifact_urls (list[str]): artifact link(s) - paper_url (str): always empty (ACSAC pages don't link to papers)

Note: ACSAC awards each paper exactly one badge level (the highest), so each artifact appears once under its highest badge section.

Source code in src/scrapers/acsac_scrape.py

def scrape_acsac_artifacts(year, session=None):
    """
    Scrape the ACSAC artifact evaluation page for a given year.

    Args:
        year: Conference year (int).
        session: Optional requests.Session to reuse.

    Returns:
        List of dicts with keys:
          - title (str): paper title
          - badges (list[str]): e.g. ['available'] or ['available', 'reviewed']
          - artifact_urls (list[str]): artifact link(s)
          - paper_url (str): always empty (ACSAC pages don't link to papers)

    Note: ACSAC awards each paper exactly one badge level (the highest),
    so each artifact appears once under its highest badge section.
    """
    if session is None:
        session = create_session()

    url = ACSAC_ARTIFACTS_URL.format(year=year)
    logger.info("Fetching %s", url)
    resp = session.get(url, timeout=30)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "html.parser")
    content = soup.find("main") or soup.find("article") or soup.body
    if content is None:
        logger.warning("Could not find main content element for ACSAC %d", year)
        return []

    # First pass: collect raw entries with their single badge
    raw_entries = []
    current_badge = None

    for element in content.descendants:
        # Detect badge section from badge images.
        # NOTE: some ACSAC pages have incorrect alt text (e.g. "Code Available"
        # for the reviewed badge), so we match on src first, falling back to alt.
        if element.name == "img":
            src = element.get("src", "").lower()
            alt = element.get("alt", "").lower().replace(" ", "_")
            matched = False
            for fragment, badge_name in _BADGE_MAP.items():
                if fragment in src:
                    current_badge = badge_name
                    matched = True
                    break
            if not matched:
                for fragment, badge_name in _BADGE_MAP.items():
                    if fragment in alt:
                        current_badge = badge_name
                        break

        # Parse paper entries from list items.
        # NOTE: ACSAC pages use malformed HTML where <li> elements are nested
        # (no closing </li> tags), so element.get_text() would include all
        # subsequent entries.  We only collect direct children up to the first
        # child <li>.
        if element.name == "li" and current_badge:
            title_parts = []
            artifact_urls = []
            for child in element.children:
                if getattr(child, "name", None) == "li":
                    break
                if getattr(child, "name", None) == "a":
                    href = child.get("href", "").strip()
                    if href and not href.startswith("#"):
                        artifact_urls.append(_strip_tokens(href))
                elif isinstance(child, str):
                    title_parts.append(child.strip())

            title = " ".join(p for p in title_parts if p).strip()
            title = re.sub(r"^Title:\s*", "", title)

            if title:
                raw_entries.append(
                    {
                        "title": title,
                        "badge": current_badge,
                        "artifact_urls": artifact_urls,
                    }
                )

    # Convert to the common artifact format (list of badges per paper)
    # ACSAC gives each paper its highest badge, so badges is a single-element list
    artifacts = []
    for entry in raw_entries:
        artifacts.append(
            {
                "title": entry["title"],
                "badges": [entry["badge"]],
                "artifact_urls": entry["artifact_urls"],
                "paper_url": "",
            }
        )

    return artifacts

acsac_scrape¶

src.scrapers.acsac_scrape ¶

scrape_acsac_artifacts(year, session=None) ¶

`src.scrapers.acsac_scrape` ¶

`scrape_acsac_artifacts(year, session=None)` ¶