I've written a lot of data tools as part of Urbanspoon and subsequent startups. I like to collect publicly available data, clean it up, normalize it, and then release it in a more useful way.
Hot tips for crawling data:
- Cache pages locally while you work on the indexing
- Nokogiri is awesome
- Don't be afraid to use regular expressions
- Initially, put data into a spreadsheet (not the db).
That way it can be checked in and diffed.
I also have a lot of subtle tricks for cleaning up messy data. For example, to see if two similar authors refer to the same person, I have a method that converts an author name to an author key. The key is just like the name, only it's been uppercased, apostrophes removed, etc. Plus weird stuff like this:
# replace all vowels with the letter E
s = s.gsub(/[AEIOUY]+/, "E")
It's little things like this hack that make a big difference in data quality.
Thanks! I've been doing some scraping projects lately and really like it a lot. There's a pretty steep learning curve, but it gets easier and easier as you go along, I think.
1. Caching pages is definitely a great idea while debugging. Especially if the data source has a request limit :)
2. I've never heard of Nokogiri, but it looks like BeautifulSoup for Ruby. I've found that Python has worked for everything I need so far, but thanks for the reference.
3. I suck so bad at regex, but using it more will help me climb that mountain.
4. One tip I've used is writing out the "INSERT INTO TABLE..." statements along with the scraped results. I definitely use CSV (and Google Refine) for general clean up and spot checking.
5. You should write a 'Data Scraping One-liners Explained' ebook :)
I spent the first half of this year writing scrapers for every newspaper in the UK. My Top Regex Tip is http://rubular.com/ - this thing saved me HOURS of my life.
Then you are the opposite of where I used to be -- I thought I understood and could use regexps. Hell, I do Perl for fun. :-)
Something like when I first sat down with Photoshop -- "Hey, I know how to program Macs [before MacOs X]. This is just using a Macintosh program, so I should have no problems"... :-)
Read "Mastering Regular Expressions". It made me feel embarrassed about my previous stupidity [Edit: Embarrassment, your name is Dunning–Kruger :-) ]. Just the first few chapters are enough to change your world.
May I add: Usually, scraping is a "one time only job", so feel free to use all the hack you want to get the job faster. For instance, wget/grep into a file, use vim to clean it up a bit, mix some awk, perl or bash. The goal is just to get the data, not to write production quality code that will be maintained for years.
For tricky pages, it can be a good idea to write tests though. Be sure to cache all downloaded pages so the tests can be run uber-fast.. It's a good way to perfect that regex. (Yes, there are programs for that, but sometime HTML have some weird newlines or characters that screw everything).
Also, I think it's important to emphasis not to "code everything". Say you've got 10 links to get on the first page.. and in all of these links you've got hundreds. Obviously, you won't go on each page and manually copy everything; however, for the first 10 links, it's useless to write a script that crawl that. Just copy it and clean it from the source. Jee, use a macro in emacs or your favorite editor if you really cant repeat a task 10 times.
Hot tips for crawling data:
I also have a lot of subtle tricks for cleaning up messy data. For example, to see if two similar authors refer to the same person, I have a method that converts an author name to an author key. The key is just like the name, only it's been uppercased, apostrophes removed, etc. Plus weird stuff like this: It's little things like this hack that make a big difference in data quality.(edit: formatting)