Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Article's first sentence: "With the advent of large language model-based artificial intelligence, semantic HTML is more important now than ever."

I think the sentence "With the advent of large language model-based artificial intelligence, semantic HTML is less important now than ever." is far more defensible. The semantic web has failed and what replaced it was Google spending a crap ton of money writing a variety of heuristics equipped with best-of-breed-at-the-time AI behind it. As AI improves, it improves its ability to extract information from any ol' slop, and if "any ol' slop" is enough, it's all the effort people are going to put out. Eventually in principle both the semantic web and that pile of heuristics are entirely replaced by AI.

(Note also my replacement of LLM with the general term AI after my first usage of LLM. LLMs are not the whole of "AI". They are merely a hot branch right now, but they are not the last hot branch. It is not a good idea to project out the next several decades on the assumption that LLMs will be the last word in AI.)



Are you suggesting that AI will solve web accessibility, which is based on semantic HTML and ARIA? Because if not, humans will still be required to ensure that web content is accessible, and in that case semantic HTML remains important.


Actually, that sounds like one of the better startup ideas I've heard around AI. Automated accessibility compliance (or something close to it) would be very useful and definitely something people would pay money for.

I fear LLMs are only about 80% up to the task, though, which is actually a very unpleasant place to be in that curve; sort of the moral equivalent of the uncanny valley. Whatever comes after LLMs though, I bet they could do it, or get very close.


80% sounds way to optimistic to me. The problem is that screen readers (and other assistive technology) have bugs and different behaviors, and some people use older versions of those tools with even more bugs and quirks. The only way to make sure that a website has a high level of accessibility is to perform manual testing in different environments. I don’t see how AI can solve this problem. And the people who perform the manual testing need to be experts in semantic HTML and ARIA to be able to identify problems and create reports. That means that semantic HTML remains important.


>80% sounds way to optimistic to me. The problem is that screen readers (and other assistive technology) have bugs and different behaviors, and some people use older versions of those tools with even more bugs and quirks. The only way to make sure that a website has a high level of accessibility is to perform manual testing in different environments.

That's if you want actual accessibility support on a wide range of old and new devices.

But the business idea the parent proposes is automated accessibility for compliance, which is the real thing that could be sold, and has a much lower bar.


My estimate of 80% included 6-12 months of serious development first, and a certain amount of budget for manual intervention for the first several dozen jobs. Certainly just flinging HTML at ChatGPT as it stands today would do nothing useful at all. Providing manual testing could easily be done as part of a higher service plan. Not only is there no rule that a startup using AI has to be just in the form of "throw it at the AI and then steadfastly refuse to do anything else", that's probably a pretty good way of filtering out the ones that will make it from the ones that won't.

Do assistive technologies have more "bugs" and "quirks" and "different behaviors" than natural text? I don't really think so. In fact I'd expect they have qualitatively fewer such things.

Semantic HTML would be important in this case... but it would be important as the output, not the input.

This hypothetical startup could also pivot into developing a better screenreader pretty easily once they built this, but there would be a good few years where an AI chewing on the HTML and HTML templates in use by a server would be practical but you can't expect every assistive technology user to be using a 64GB GPU to run the model locally. Certainly that would factor into my pitch deck, though.

I'd give more credence to the "it has to be perfect to be useful at all" argument you're going with here if it weren't that I'm pretty sure every user of such technology is already encountering a whole bunch of suboptimal behavior on almost every site today.


LLM could transform "bad HTML" into good HTML; add ARIA tags, add image captions, etc.


Unless it’s 100% reliable or near 100% reliable, you’d still need manual testing. Right now, automatic accessibility testing can’t even detect most accessibility issues. So we haven’t even reached the stage where all issues are detected by tools, and probably never will. Fixing all issues automatically is significantly harder than detecting them.


Given how bad accessibility is, it seems like even something imperfect could be a big leap forward for a lot of sites.


>Unless it’s 100% reliable or near 100% reliable, you’d still need manual testing.

Not unless:

(a) it's X% reliable now, and it would be Y% < X% if done via LLMs.

(b) businesses actually care for increased reliability, and not just for passing the accessibility requirements.

Most businesses could not give less f...., and don't do "manual testing" today either. Just add the token required tags. That's true even when they do business with the government (which mandates this even more highly).

LLM-driven accessibility info would be an improvement.


The idea generalizes. Imagine an archiver which applies a transform to a site. Adding semantic markup - or censoring parts that someone finds offensive. If the original author agrees, they might offer an api so the transformation is linked to by the original. Or perhaps the transformer could make an agreement with/fool Google into linking to their version rather than the original. Perhaps because it's "safer".

Oh yes, a great startup idea.


Someone's on it already (but maybe there's room for competition, if https://adrianroselli.com/2020/06/accessibe-will-get-you-sue... is any indication): https://accessibe.com/accesswidget/artificial-intelligence


An LLM-based accessible browser could render 80% of the Web accessible at once it the tech works.


If semantic HTML is important for accessibility and for software to be better able to parse information out of it, and AI solves the latter, semantic HTML is now less important because some of the use cases that needed it previously no longer need it. If you take "less important" as a moral/value statement instead of in terms of total utility provided, and assume that AI will have zero accessibility benefits, it will merely be as important as today, which is still at odds with the assertion of the original article that it would become more important. N.B. this seems doubtful, given how e.g. you can now past a bunch of code into an LLM and ask it questions quite naturally -- something I can easily see adapted to e.g. better navigating apps using only voice and screenreaders.


This has been tried and doesn't work, which doesn't mean it will never work in the future! There are a few companies offering solutions in this space, but they don't work, are often worse than the problems they're trying to solve and are a privacy disaster. The companies peddling them often engage in shady business practices, like falsely claiming that their overlays can protect you from ADA lawsuits[1], while actually suing the people who expose their lies[2]. Most accessibility practitioners and disabled users themselves are warning the public to avoid those tools[3].

[1] https://adrianroselli.com/2020/06/accessibe-will-get-you-sue... [2] https://adrianroselli.com/2023/05/audioeye-is-suing-me.html [3] https://overlayfactsheet.com


AI will solve web accessibility by screen readers that summarize visual content, ignoring ARIA and making it irrelevant. Multimodal GPT-4 can take a screenshot jpeg and answer questions about what’s in it (buttons, links, ads, headers, etc). The future of accessibility is rendering DOM to jpeg and asking GPT to be your eyes; we’ll look back on semantic markup as a failed idea that was never going to work


I am curious about what post-LLM SEO is going to look like.

> The semantic web has failed and what replaced it was Google spending a crap ton of money writing a variety of heuristics equipped with best-of-breed-at-the-time AI behind it.

Arguably, there were insufficient incentives to fully adopt semantic HTML, if your goal was just to have the most relevant parts of your content indexed well enough to get ranked.

> As AI improves, it improves its ability to extract information from any ol' slop, and if "any ol' slop" is enough, it's all the effort people are going to put out.

If the goalpost shifts from “getting ranked” to “enabling LLMs to maximally extract the nuance and texture of your content”, perhaps there will be greater incentive to use elements like <details> or <progress>. Websites which do so, will have more influence over the outputs of LLMs.

Feels like the difference between being loud enough to be heard vs. being clear enough to be understood.


> The semantic web has failed and what replaced it was Google spending a crap ton of money

Aren't schema.org and Wikidata/Wikipedia still powering most of Google's rich search results?

I heard them announce the new result page with bard but I probably didn't see it because of ad-blindness or it's not yet releases in my location, have to look this up...


>Aren't schema.org and Wikidata/Wikipedia still powering most of Google's rich search results?

Were they ever?


Well schema.org was not referring to an organization or entitity, but its published schemas. I'd argue these were and are driving a lot of rich results, especially for local businesses.


Yes.


AIs are magic to me. The pattern recognition feature of human I've always thought pretty unique and hard to replicate. We use it when scanning the slop on websites to do some kind of data extraction. I was part of the semantic web camp in my brain, but you are right, if machines can seemingly make sense of the slop then why bother?


agree so much. Projects that aim to build a data resource and then let AI use that resource are missing the point. The AI is the data resource.

Some projects claim that knowledge graphs or other data assets can help the AI retrieve 'true' knowledge. Personally, I believe that the better approach is to develop methods that allow AIs to create their own data assets, the weights in their networks is one of those assets.

The question of truth is still a very hard one. How do you tell an AI that some knowledge is more trustworthy than other knowledge? People have this issue too though.


While the issue of "truth" is interesting and important, it is also fairly orthogonal to the task of simply extracting what a given page or bit of content claims. (Perhaps not 100% orthogonal in the absolute limit, but generally so.)

As absolutely hard as I have gone against the semantic web community at times over the post few years, I do not in the slightest hold a failure to "determine truth" against them. I consider them to have been tilting at windmills as it is, criticizing them for failing to conquer that windmill, which humanity has been jousting with since the dawn of recorded history (and probably beyond), would be a degree of cruel I couldn't entertain. :)


If you're relying on a stochastic process like network weights to encode truth then I have some oil to sell you.


what projects are trying to use knowledge graphs to retrieve truth? i've been playing around with that approach. how do you encode your own "truth" that may be different from anothers?


> The semantic web has failed

literally by no metric is this true other than tech bros saying it on HN. The entire internet is powered by websites using semantic markup and clients querying it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: