Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Where is all the book data? (publicbooks.org)
61 points by hkhn on Nov 11, 2022 | hide | past | favorite | 11 comments


Oh, hey. I was just looking at this article the other day.

I think this article makes some good points, but it's a little too absolutist about this data. As far as I can tell, if you are an author or industry professional, you can get access to the following data:

* If you want data on your own books: sign up for Amazon Author Central and they'll give you the BookScan data on your books. This is free. https://press.aboutamazon.com/2010/12/weekly-nielsen-booksca...

* If you want data on comps (i.e., comparable books, or books you are competing against): sign up for Publishers Marketplace and pay for the monthly package ($25/month on top of the PM subscription). This gives you the ability to track 5 ISBNs (and I assume, you can pick new ISBNs every month). https://www.publishersmarketplace.com/bookscan/about.cgi#dat...

* If you want public library data checkout data: as linked in the article, go to the Seattle Public Library. Free. https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-y...

The situation only really gets ugly if you want access to broad market data (i.e., across all ISBNs for a given time window, and covering a majority of retail outlets). The best I'm able to find is this comment by Kristen McLean from NPD: https://countercraft.substack.com/p/no-most-books-dont-sell-...

But this data is (a) limited, and (b) I think it has some pretty serious issues [1]. I sent an email to Kristen to try to address this, but so far no response. (If anyone has any connections that might help, please contact me!)

And if you want to get access to the data yourself, you're talking about something to the tune of $2,500 USD. And the terms are pretty restrictive. https://www.publishersmarketplace.com/bookscan/about.cgi#pri... https://www.publishersmarketplace.com/bookscan/terms.shtml

I am actively working on improving this situation, and I've got some ideas for what we could do while still abiding by NPD's terms. If that's something that interests you, please contact me (see my profile).

[1]: My issue with Kristen's analysis is that it follows (in her words) a "conveyor belt" pattern. That is, the time window is fixed. Within that time window, some books have been on the market 364 days. Some may have been on the market 1 day. So it's not surprising that some books have very few sales: they may simply have not been on the market long enough. And you can't just say, "well, multiply the data by 2x to account for the average case" because I'm pretty sure that doesn't work. But without real data I can't fix this.


Not about sales data, per this article, but bibliographic metadata. Check out POSI, the Principles for Open Scholarly Infrastructure:

https://openscholarlyinfrastructure.org/

There are people dedicated to open metadata and open systems to work with it.

https://openscholarlyinfrastructure.org/posse/

(I work at Crossref)


The book data that I wish was more accessible was bibliographic data. I wish there was a cheap ISBN API (cheap enough for an individual to afford) that I could use to look up all of the data for my book from just a barcode scan. I know there are some API providers for that, but the plans are clearly meant for big users and not for someone who just wants to use it a couple hundred times.

This would be something the Library of Congress should run, or maybe one of the university library consortia, like the formerly-named Committee on Institutional Cooperation, (which has renamed itself the 'Big Ten Academic Alliance', because football - https://btaa.org/library/Libraries )


The Open Library project https://openlibrary.org/developers has a free ISBN API. I used it along with tesseract OCR and a webcam to build a database of my physical books (218 total). I never had any rate limiting issues.


The dominant force in that department has been the OCLC.

I submitted an article yesterday about its own power-grab regarding bibliographic metadata, "Let the Metadata Wars Begin":

<https://scholarlykitchen.sspnet.org/2022/06/22/oclc-sues-cla...>

<https://news.ycombinator.com/item?id=33556442>

That lists a number of resources:

National Libraries <https://www.dnb.de/EN/Ueber-uns/Presse/ArchivPM2015/metadate...> (linked data: <https://www.loc.gov/item/lcwaN0018834/>)

Harvard <https://hangingtogether.org/harvard-bibliographic-data-relea...>

MetaDoor <https://meli.org.il/wp-content/uploads/2021/12/Chani-Yehuda_...> (from Clarivate, subject of the lawsuit headlining this article)

I'm also aware of the Internet Archive's Open Library (initiated by Aaron Swartz, see below), and some Wikidata efforts based largely on ISBN. And of course the almost wholly useless HathiTrust.

On Open Library data: <https://openlibrary.org/help/faq/using>

Wikipedia Book Sources: <https://en.wikipedia.org/wiki/Special:BookSources/>

There's also the lawsuit by OCLC against Clarivate:

<https://www.infodocket.com/2022/06/15/oclc-files-lawsuit-aga...>

And ... searching "OCLC" in the HN archives turns up numerous other references, including Aaron Swartz (miss you, guy), "Stealing your Library: The OCLC Powergrab":

<https://web.archive.org/web/20081218092812/www.aaronsw.com/w...>

<https://news.ycombinator.com/item?id=362769> (2008)

Algolia search for "OCLC" on HN: <https://hn.algolia.com/?q=oclc>


yeah. why i left the "library industry" and now work 'for the man'.

the library industry should have embraced open source in the 90s but they just never "got it". they seem to think they just need to be involved in some hyper expensive vendor projects and somehow that will bring them validation.

i worked in this tiny library with a few ten thousands books and they were paying for system that used oracle as the Database. so basically they were paying for oracle. this was 20 years ago.

then there is JSTOR and the whole Aaron Swartz thing. JSTOR acted really inappropriately

like i feel really bad about what has happened to libraries over the past 20 years, with funding cut to the bone. but they kind of did it to themselves by thinking that "serving the public" means shoveling the publics money to proprietary vendors for no apparent reason. like OCLC has no right to take information generated by public institutions that are almost entirely funded by local taxpayers and somehow claim ownership of that information, and act like a monopolistic for profit corporation.

there are a lot of very innovating library people doing stuff like maker spaces and kids education despite all the hardships but.... this revolution has not made it to the 'library industry'.


NB: JSTOR, to its credit, Larry Lessig says "great credit" (<https://lessig.tumblr.com/post/40347463044/prosecutor-as-bul...>), didn't pursue prosecution of Swartz. M.I.T., however certainly did (also noted by Lessig). From Abelson's report, commissioned by MIT:

If the Review Panel is forced to highlight just one issue for reflection, we would choose to look to the MIT administration’s maintenance of a “neutral” hands-off attitude that regarded the prosecution as a legal dispute to which it was not a party. This attitude was complemented by the MIT community’s apparent lack of attention to the ruinous collision of hacker ethics, open-source ideals, questionable laws, and aggressive prosecutions that was playing out in its midst. As a case study, this is a textbook example of the very controversies where the world seeks MIT’s insight and leadership. A friend of Aaron Swartz stressed in one of our interviews that MIT will continue to be at the cutting edge in information technology and, in today’s world, challenges like those presented in Aaron Swartz’s case will arise again and again. With that realization, “Neutrality on these cases is an incoherent stance. It’s not the right choice for a tough leader or a moral leader.”

<http://swartz-report.mit.edu/docs/report-to-the-president.pd...>

pp. 100-101

And the US DoJ and courts have bloody hands.

The careers of Ortiz and Heymann (DoJ) have suffered somewhat: Heymann left the DoJ, Ortiz's ambitions for higher office (reputedly she'd had interest in the Mass. governorship) were thwarted. The judge remains on the US District Court of Massachussetts.


JSTOR went apeshit internally and reported Swartz to the authorities. like they are the ones who got the ball rolling. and they didn't have to.

maybe they decided to do the right thing later but it was too little too late.

this is why SciHub exists.... there has never been a netflix or spotify for academic research and probably never will.


While I agree with the general argument in the article that sales data and similar metrics should be public, I think there's a lot more that can be done to unlock all of the knowledge stored in books. There are vast amounts of knowledge that humanity has built up over centuries that are either hard to find or hard to access unless you know where to look. How does someone like me discover that knowledge for a topic I'm interested in?

I wrote a book that was recently published to help junior and mid-level programmers build up their soft-skills to advance their career[0]. The book was published by Holloway[1]. They have an interesting platform to solve this problem, which is why I chose to publish with them. They publish works primarily through their online reader, which is indexable by search engines. So someone searching for "How to get up to speed on a new codebase" in their preferred search engine could stumble across the chapter titled "How to read unfamiliar code"[2] and read a free preview of the book. Over time, people can discover and access the knowledge stored in any book that is published on Holloway's platform.

Another nice side effect of the platform is that it can be updated over time, so outdated knowledge or content can be revised, updated, and re-indexed by the search engines as knowledge about topics evolve.

If you're considering writing a book, or have a manuscript and are looking for a publisher, I'd recommend giving Holloway a look to see if it would be a good fit.

[0]: https://www.holloway.com/b/junior-to-senior

[1]: https://www.holloway.com/

[2]: https://www.holloway.com/g/junior-to-senior/sections/how-to-...


One of the bigger complaints about AI art generation is that "oh, it'll become a closed loop system, and then we'll all be sitting at our chairs watching a neural network spew art all day while human artists starve to death". This is kind of funny because, if Public Books' article here is even remotely true, the existing publishing system is already a closed loop. Publishers only commission or purchase works that match the particular taste profiles that are already trained into the sales data. If you want to make something new, the publishing companies have already boycotted and cancelled you.


There is an out which is self-publishing. That's now entirely cheap (read: you can do it yourself for free).

You don't get the publisher money, but it's an option available for you, and you have somewhat the same chance (read: zero) of striking it rich and becoming popular.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: