Correctly reading the data from XLSX is a lot more complex than described or implemented here, mostly because Excel is so robust in reading files and there are many sloppy writers. If you're interested, there's a many-thousand page ECMA-376 specification: https://www.ecma-international.org/publications/standards/Ec...
- to correctly get the first worksheet, you actually need to parse the workbook.xml file and look into the sheets array to find the corresponding relationship IDs. This is explained in section 18.2.20 (page 1579 in the part 1 PDF). iOS Numbers used to write worksheets in the opposite order, which messes up the naive attempt to read the relationships file in order.
- the attribute "s" in a cell is an index into the styles table, while the cell type "s" corresponds to the shared string table. If you're curious, its in section 18.1.3.4 (page 1604 in the part 1 PDF)
PS: We build and maintain parsers and writers for spreadsheets in JavaScript (https://github.com/SheetJS/js-xlsx/ is our most popular project), including a CLI script to convert files to CSV. Some of our users use JSC in the context of Swift applications, ingesting data in JS and returning a CSV for further processing in Swift.
Honestly, I'd love to see if there are organizational tips on managing (and using!) a document that large.
I feel like technical writing is the closest to coding we get in plain languages, but there are still critical differences. (technical writing is trying to give instruction to a human, while coding is giving instructions to code while giving a lot more context and description to a human). Specs cross this line a bit more - it's giving descriptions to a human with the intention of giving instructions)
Unfortunately, while I can find good code and bad code, I tend to find bad specs and WORSE specs. I can see progress (the various HTML5 and related specs are vastly better than previous versions, for example), but anytime I go in with a question (which is admittedly rare) I spend a lot of time finding the salient part compared to related-but-missing-the-vital-piece part, which is actually the exact same problem that I think the most common problem in maintainable code: making it easy to not only know how, but WHERE.
Are there lessons from specs we can learn? Do the good ones have some sort of "concept" section that makes the reading of it easier? Does each subsection do that?
When I worked at Boeing, technical specs were built in a giant object/hierarchical database. There would be objects for Requirements, and in the design you could link Systems to the Requirements that they implemented. Then you could trace back from the design to see what requirements weren't implemented, or what systems had no apparent purpose, or who edited a particular requirement last.
When they needed a hard copy of The Spec, they could export the hierarchy of requirements (automatically numbered, of course) as HTML or MSWord or whatever.
It was, of course, a giant nearly-unreadable mess, like "Req 1.2.3.4.5: The Foo system shall have a Bar module.", and then 25 more sub-requirements that all start out with "The Foo module's Bar subsystem shall ___" -- and made no sense unless you had just read the previous 30 pages, anyway.
It was not a good system, but it had hints of a good system in it. I think it was a good concept but the UI was terrible and people didn't seem to take much care when working on it.
The advice is that it should only exist in machine-readable form and the reason why Microsoft does it this way is to ensure lock-in. See also MSSQL, for which there exists no complete machine-readable spec ANYWHERE (as far as I could tell when I spent 2 days looking a few years ago).
A company that spends millions of dollars employing technical writers to publicly document a format probably isn't conspiring to keep that format secret. Maybe the macaronis aren't the shape you wanted but you got the macaronis.
IIRC it was in response to many government sources requiring open formats (a good instinct) so just because they did it doesn’t mean they wanted to. They may have been forced to, and done the minimum as a result.
I believe it was the antitrust lawsuits that forced them too, and it shows; if you look at all the MS protocol/format documents and compare them to something like RFCs and ITU/IEEE/ANSI standards, the MS docs are noticeably harder to read with their verbosity, weird syntax notations and conventions, and almost look as if they were deliberately obfuscated.
I wrote an XLSX(spreadsheet) writer in golang a few years ago and still maintain it. I also deal with a bunch of other several hundred to thousand page docs semi regularly and the best advice I have is to make use of the index, bookmarking pages, and a ton of cmd+f searching for various keywords.
I was talking with a lawyer friend the other day who confirmed a good index is really a must for long documents.
In my experience, a good index doesn't just show where any usage is, it also covers where it is significant (bolding, sometimes subcategorizing).
In the big texts (granted, for me these are almost always RPG books, but I've read/used a lot of those) that difference is essential, and the difference between the good ones (e.g. GURPS) and the bad ones (most WW books...the ones that HAVE indexes) is very noticeable and definitely impacts not just ease-of-use, but effective-ness-of-use. Being able to really get/refresh uses of different bits has a direct impact on whether I apply those concepts consistently and correctly or whether I do something that works well enough.
This definitely describes what happens when changing code too, but we don't have that same option for "significant usage". We can get every usage, usage numbers counted by file, and definitions, but not when a use is significant. We rely on tools to get those numbers, because any kind of manually supported index (of code) is doomed to failure, much as most texts fail to have indexes, or at least fail to have very useful ones.
I wonder if that "significant usage" is something we can do something about. What does that even mean? I've been looking at code linguistically a lot and I definitely can see how we can use syntax to better indicate the focus of code (vs trivial but necessary side bits), is there a way to mechanically note that? It would still require coders to write that way, but frankly I feel we need better (or at least more clear) best practices on that front anyway. If we expand our lexicon of constructs, and mechanically make use of it, it becomes testable, reliable, and still more communicative that the current "this feels good to me, therefore I declare the code 'more readable'" approach.
> Are there lessons from specs we can learn? Do the good ones have some sort of "concept" section that makes the reading of it easier? Does each subsection do that?
Look at the OpenDocument Format. Does the same as Office Open XML (the microsoft one) in only a few hundred pages.
It is quite a bit smaller than OOXML. OOXML is about 5-6 times the size of ODF. There are three reasons for this.
1. ODF has cleaner and leaner markup. OOXML is uglier and more verbose, leading to more things that the spec has to document..
2. The OOXML spec goes into more detail for a lot of things that the two have in common.
3. The OOXML spec seems to have a lot more introductory or primer material.
The impression I got from looking at both, but not actually trying to use either of them, is that if I had to implement a full featured office suite solely from the spec, with no reference implementations or example documents, it would be easier if I picked ODF, but would probably have better interoperability if I picked OOXML. The strongest impression I got, from both of them, though, is that there is no freaking way I want to implement an office suite!
I appreciate such detailed feedback, thank you! Parsing of workbook.xml is implemented in the library itself in `parseWorksheetPaths` function I mentioned it briefly in the article, but decided to omit it as main focus was on Codable protocols.
I will definitely update the "s" attribute parsing to have a more sensible name. Will also link to the standard from the README file, although not sure that will help with a document of this size.
I laughed when I started reading this comment and then looked at your username. I've used your "js-xlsx" library and stepped through quite a bit of the code. I still can't really understand how you begun to write that library, and I'm curious how you approach reading the Open XML documentation. Do you have a large team of engineers maintaining that codebase?
Jokes aside, I can't fathom having to consult a thousand page manual to deal with this stuff, I can barely read the README.md for a framework that I want to include in my project.
Correctly reading the data from XLSX is a lot more complex than described or implemented here, mostly because Excel is so robust in reading files and there are many sloppy writers. If you're interested, there's a many-thousand page ECMA-376 specification: https://www.ecma-international.org/publications/standards/Ec...
- to correctly get the first worksheet, you actually need to parse the workbook.xml file and look into the sheets array to find the corresponding relationship IDs. This is explained in section 18.2.20 (page 1579 in the part 1 PDF). iOS Numbers used to write worksheets in the opposite order, which messes up the naive attempt to read the relationships file in order.
- the attribute "s" in a cell is an index into the styles table, while the cell type "s" corresponds to the shared string table. If you're curious, its in section 18.1.3.4 (page 1604 in the part 1 PDF)
PS: We build and maintain parsers and writers for spreadsheets in JavaScript (https://github.com/SheetJS/js-xlsx/ is our most popular project), including a CLI script to convert files to CSV. Some of our users use JSC in the context of Swift applications, ingesting data in JS and returning a CSV for further processing in Swift.