I've written a lot of data tools as part of Urbanspoon and subsequent startups. ...

bgraves · on Oct 5, 2011

Thanks! I've been doing some scraping projects lately and really like it a lot. There's a pretty steep learning curve, but it gets easier and easier as you go along, I think.

1. Caching pages is definitely a great idea while debugging. Especially if the data source has a request limit :)

2. I've never heard of Nokogiri, but it looks like BeautifulSoup for Ruby. I've found that Python has worked for everything I need so far, but thanks for the reference.

3. I suck so bad at regex, but using it more will help me climb that mountain.

4. One tip I've used is writing out the "INSERT INTO TABLE..." statements along with the scraped results. I definitely use CSV (and Google Refine) for general clean up and spot checking.

5. You should write a 'Data Scraping One-liners Explained' ebook :)

semanticist · on Oct 6, 2011

I spent the first half of this year writing scrapers for every newspaper in the UK. My Top Regex Tip is http://rubular.com/ - this thing saved me HOURS of my life.

berntb · on Oct 5, 2011

>>I suck so bad at regex

Then you are the opposite of where I used to be -- I thought I understood and could use regexps. Hell, I do Perl for fun. :-)

Something like when I first sat down with Photoshop -- "Hey, I know how to program Macs [before MacOs X]. This is just using a Macintosh program, so I should have no problems"... :-)

Read "Mastering Regular Expressions". It made me feel embarrassed about my previous stupidity [Edit: Embarrassment, your name is Dunning–Kruger :-) ]. Just the first few chapters are enough to change your world.

Edit: I might add, I still can't use Photoshop.

phzbOx · on Oct 5, 2011

May I add: Usually, scraping is a "one time only job", so feel free to use all the hack you want to get the job faster. For instance, wget/grep into a file, use vim to clean it up a bit, mix some awk, perl or bash. The goal is just to get the data, not to write production quality code that will be maintained for years.

For tricky pages, it can be a good idea to write tests though. Be sure to cache all downloaded pages so the tests can be run uber-fast.. It's a good way to perfect that regex. (Yes, there are programs for that, but sometime HTML have some weird newlines or characters that screw everything).

Also, I think it's important to emphasis not to "code everything". Say you've got 10 links to get on the first page.. and in all of these links you've got hundreds. Obviously, you won't go on each page and manually copy everything; however, for the first 10 links, it's useless to write a script that crawl that. Just copy it and clean it from the source. Jee, use a macro in emacs or your favorite editor if you really cant repeat a task 10 times.

phil · on Oct 5, 2011

Yes, repleceng ell vewels weth the letter E clerle empreves dete qelete ;)