Maybe consider scraping bulbapedia into an SQL database? :) -- i'm dreaming SELE...

gonnakillme · on Oct 17, 2013

The veekun pokedex[1] is on github[2] and they have a crapload of csv[3] files available.

[1]: http://veekun.com/dex [2]: https://github.com/veekun [3]: https://github.com/veekun/pokedex/tree/master/pokedex/data/c...

mappu · on Oct 17, 2013

That's excellent, thank you for pointing those out! Renders the whole discussion somewhat moot.

klodolph · on Oct 17, 2013

> no wonder people like key-value stores

You're doing a relational query: select, join, project. Translate it to a key-value store and you're going to end up with way more boilerplate. Using a key-value store for a relational query is like trying to use a screwdriver to drive a nail.

mappu · on Oct 17, 2013

It definitely suits a relational query.

But assuming you have indexes `pokemon_learn_moves_by_move_id` and `moves_by_name`, you could write

    pokemon_learn_moves_by_move_id[ moves_by_name["Hydro Pump"].id ]
        .map( |id| [pokemon[id].sp_atk, pokemon[id].name] )
        .greatest()[1];

(using a hypothetical system) which i think is a bit less boilerplate. Although, now i've explicitly written an execution plan rather than letting the SQL engine decide (i had `sort()[0]` there instead of `greatest()` for a while, which i think sums up why the SQL approach is generally better).

_frog · on Oct 17, 2013

I've thought about doing this before actually, don't know how they'd feel about someone scraping their content though.

mappu · on Oct 17, 2013

It's not like bulbapedia own the fundamental content - do they place a public license on the wiki pages?

Ideally bulbapedia would provide mediawiki dumps for this, but they don't, and they've gone on record saying they don't intend to. They did leave the mediawiki API open though if you want to crawl a clean rip of each page's wikitext - the default mediawiki API guidelines are also intact, which say that single-threaded crawls should be acceptable in almost all instances, but you should warn the site owner before initiating a multi-threaded scrape.

_frog · on Oct 17, 2013

It looks like all of Bulbapedia's content is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike license. http://bulbapedia.bulbagarden.net/wiki/Bulbapedia:Copyrights

kozlovsky · on Oct 17, 2013

Actually, it is not always necessary to write a lot of boilerplate code when using a SQL database. For example, with Pony ORM the same SQL query could be written as following:

    select(p.name for p in pokemon
           if "Hydro Pump" in p.learn_moves.move.en_name
           ).order_by("p.sp_atk").first()