acm_scrape¶
src.scrapers.acm_scrape
¶
Scrape ACM Digital Library conference proceedings for paper artifact badges.
ACM conferences (e.g., CCS, SOSP) display artifact evaluation badges on individual paper pages in the ACM DL. This module:
- Uses the DBLP API to discover all papers for a given proceedings volume.
- Attempts to scrape badge information directly from ACM DL paper pages.
- Gracefully degrades when ACM DL access is blocked (Cloudflare 403), returning the DBLP paper list without badge data.
Usage examples
Scrape CCS 2024 (attempts ACM DL, falls back to YAML)¶
python acm_scrape.py --conference ccs --years 2024
Scrape CCS 2023 and 2024¶
python acm_scrape.py --conference ccs --years 2023,2024
Output as YAML suitable for the pipeline¶
python acm_scrape.py --conference ccs --years 2024 --format yaml
scrape_acm_proceedings(conference, year, session=None, max_workers=4, delay=0.5)
¶
Scrape an ACM DL proceedings volume for paper titles and artifact badges.
- Gets papers from DBLP (always works).
- For each paper, tries to scrape badge info from ACM DL.
- If ACM DL is blocked (403), the function stops attempting further papers and returns papers with empty badge lists (partial data).
Returns:
| Type | Description |
|---|---|
|
(papers_list, acm_dl_accessible) where acm_dl_accessible is a bool |
|
|
indicating whether ACM DL scraping succeeded. |
Source code in src/scrapers/acm_scrape.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | |
scrape_conference_year(conference, year, session=None, max_workers=4, delay=0.5)
¶
Get artifact data for an ACM conference/year via DBLP + ACM DL scraping.
Returns a list of dicts ready for to_pipeline_format().
Source code in src/scrapers/acm_scrape.py
285 286 287 288 289 290 291 292 | |
to_pipeline_format(artifacts)
¶
Convert scraped/merged artifacts to the format used by generate_statistics.py. Only includes papers that have at least one badge.
Source code in src/scrapers/acm_scrape.py
295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | |
get_acm_conferences()
¶
Return the ACM_CONFERENCES dict for use by the pipeline.
Source code in src/scrapers/acm_scrape.py
319 320 321 | |