Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe consider scraping bulbapedia into an SQL database? :)

    -- i'm dreaming
    SELECT pokemon.name
    FROM
      pokemon
      JOIN pokemon_learn_moves ON pokemon.id = pokemon_learn_moves.pokemon_id
      JOIN move ON pokemon_learn_moves.move_id = move.id
    WHERE
      move.name_en = "Hydro Pump"
    ORDER BY pokemon.sp_atk DESC LIMIT 1;
Actually, from this perspective there's a lot of boilerplate, no wonder people like key-value stores... Then use python-nltk to make an english wrapper (and say goodbye to your free time for the next month!)


The veekun pokedex[1] is on github[2] and they have a crapload of csv[3] files available.

[1]: http://veekun.com/dex [2]: https://github.com/veekun [3]: https://github.com/veekun/pokedex/tree/master/pokedex/data/c...


That's excellent, thank you for pointing those out! Renders the whole discussion somewhat moot.


> no wonder people like key-value stores

You're doing a relational query: select, join, project. Translate it to a key-value store and you're going to end up with way more boilerplate. Using a key-value store for a relational query is like trying to use a screwdriver to drive a nail.


It definitely suits a relational query.

But assuming you have indexes `pokemon_learn_moves_by_move_id` and `moves_by_name`, you could write

    pokemon_learn_moves_by_move_id[ moves_by_name["Hydro Pump"].id ]
        .map( |id| [pokemon[id].sp_atk, pokemon[id].name] )
        .greatest()[1];
(using a hypothetical system) which i think is a bit less boilerplate. Although, now i've explicitly written an execution plan rather than letting the SQL engine decide (i had `sort()[0]` there instead of `greatest()` for a while, which i think sums up why the SQL approach is generally better).


I've thought about doing this before actually, don't know how they'd feel about someone scraping their content though.


It's not like bulbapedia own the fundamental content - do they place a public license on the wiki pages?

Ideally bulbapedia would provide mediawiki dumps for this, but they don't, and they've gone on record saying they don't intend to. They did leave the mediawiki API open though if you want to crawl a clean rip of each page's wikitext - the default mediawiki API guidelines are also intact, which say that single-threaded crawls should be acceptable in almost all instances, but you should warn the site owner before initiating a multi-threaded scrape.


It looks like all of Bulbapedia's content is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike license. http://bulbapedia.bulbagarden.net/wiki/Bulbapedia:Copyrights


Actually, it is not always necessary to write a lot of boilerplate code when using a SQL database. For example, with Pony ORM the same SQL query could be written as following:

    select(p.name for p in pokemon
           if "Hydro Pump" in p.learn_moves.move.en_name
           ).order_by("p.sp_atk").first()




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: