Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From the Wikipedia entry for "data scraping":

> the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program

Snapshotting JSON files can be incredibly useful, but I don't think you should call it "scraping".



I think that article actually further supports my position here: https://en.wikipedia.org/wiki/Data_scraping

It has sections covering things called "data scraping" and "web scraping" and "screen scraping" and "report mining", and links to articles about "data mining" and "data mining" and "search engine scraping" as well.

To me, that indicates that the terminology around this stuff is already extremely vague and poorly defined... and the suffix "scraping" is up for grabs for anyone who wants to further define it!

(If you don't like me calling this technique "Git scraping" you're going to /really/ hate the name I picked for my shot-scraper tool https://shot-scraper.datasette.io )


Scraping also has, in some contexts, negative associations. In a project for a non-profit that I'm involved with that coincidentally was originally a remix of some of Simon's code for one of these "Git scraping" projects + Datasette, I recently made the decision to refer to it strictly as what it is: a crawler.

I'm less warm at this point to the general idea behind the hack of dumping the resulting JSON crawl data to GitHub. It's a very roundabout way of approaching basically what something like TerminusDB was made for. It definitely feels like the main motivation was GitHub-specific stuff and not Git, really—namely, free jobs with GitHub Actions—and everything else flowed from that. It turns out that GitHub Actions proved to be too unreliable for executing on schedule, anyway, so we ported the crawler to JS with an eye towards using Cloudflare Workers and their cron triggers (which also come in a free flavor).


My first implementation of this pattern predated GitHub Actions and used CircleCI - though GitHub Actions made this massively more convenient to build.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: