It is more than the spam war...a lot of content is just no longer produced and exposed to the open web (think about how much content goes into tiktok, discord, etc and you will never get that into your search results). Google has less useful content to index, algorithms can't fix that. There is more spam only because that's the only open content that gets added massively.
The winners of this battle will be the places where content is generated (or curated) - and reddit is perhaps the most important content hub generator (it's not just an aggregator anymore - comments about some news can be often more interesting than the news itself). Indexers (and language models) are useless without content to scrap.
I can't believe that. There's still so many personal blogs by real people. They just don't show up in the first 3 pages unless you query for them very specifically.
Seems most queries result in something like 30% SEO spam sites, 30% quora, 30% reddit, 10% other.
Edit: I don't disagree that Discord/Youtube/Other closed gardens have taken open searchable data away, but it's not like there's now no authentic searchable data at all. Perhaps Google also needs to learn to search those closed gardens better.
Google flourished because it could find forums (and blogs) and mine those, but much of that content has disappeared into Facebook and Discord (and YouTube - we must not discount how many things that would have been easily parseable blogs are now buried in livestreams and videos).
Discord is probably the worst of all. I'm not a gamer and I hate it so much that a lot of tech content is now locked behind private Discord channels. Even Facebook is more discoverable than that
Even when you are already on discord, search and trying to read old conversations is awful on discord, because that's not at all what discord was made to do.
So I've been working on a side project to make a Youtube channel I watch have its content be more discoverable through text. I've had great results by scraping the Youtube transcription, and running that for a few passes through GPT 3.5 with some prompts to essentially act as an editor. The original transcription was often terrible in some spots. Just whole phrases or multiple words mistranscribed throughout. For almost all of them, GPT 3.5 was able to clean them up and restore the original meaning through understanding the context of the monolog and fixing obviously incorrect words or phrases.
I've watched through a sample of about 20 of the 3,000 videos I'm working through, and the corrected transcription really did an amazing job at restoring the original meaning from the spoken words that was hard to understand from the original machine transcription.
That is exactly where LLMs are useful. (People thinking of them as "AI", meaning AGI is just so wrong. Writing legal briefs??) Using them to ex post facto adjust transcripts in order to make them available and searchable is great.
>we must not discount how many things that would have been easily parseable blogs are now buried in livestreams and videos
on the flip side, would those blogs have been created at all if they weren't financially motivated by streaming/video to provide the content?
there's a lot of discussion here about internet commuities, but this comment brings to question why blogs started to die down to begin with. At least with reddit you get clout if you share stuff (useless clout, but sometimes you just want a pat on the back).
Blogs are parallel to research papers in a sense. They're useless without peer review unless you're already intimately familiar with the source material and able to critically evaluate the contents.
So Blogs are more useful when they're aggregated through a site like Reddit, where users have already done the vetting on whether the linked page is valuable. Reddit comments are invaluable to pages by adding additional context. Noting when the content has become dated or inaccurate due to external changes, etc. Sites like Brian Kreb's blog are the exception as the author is well known and respected. But the general blogs? It takes time to earn that community respect.
Then beyond that, how often have you gone on the hunt for something obscure only to run across 3 or more blog pages which look entirely unique, but have the exact same article pasted to them? It isn't that the contents are bad/wrong/inaccurate, but rather who do you trust? How much effort are you going to put in to finding which blog was the original, written by the expert and which ones are bots copying the info?
>where users have already done the vetting on whether the linked page is valuable.
and ironically enough, if you post your own blog on reddit to be critiqued, there's a good chance it is removed for "self promotion". Funny how that "vetting" works, huh? So you get back to "how do I make my blog discoverable so it can be peer reviewed" and we're at square 1 again.
>How much effort are you going to put in to finding which blog was the original, written by the expert and which ones are bots copying the info?
A lot if it's important. Because as is I already have to do that muckracking on reddit to see who is trying to understand or even read the article and who just wants to soapbox their tangential pet rant. tracing a source back is child's play in comparison.
For me YouTube is always on the top, instead of the text pages where I can read the answer in a few seconds Google pushes me their video platform, probably in the hope of making money. I am logged in so I do not understand how those geniuses working at Google would think that videos in a language I do not know might be more relevant then text content.
For me YouTube is always on the top, instead of the text pages where I can read the answer in a few seconds Google pushes me their video platform, probably in the hope of making money.
To be fair, i have the same problem with Duck.
I wish i could backlist sites from my search results. YouTube and Pinterest are not helpful for the things i look for.
How great is your wish? If you host your own instance of Whoogle, which gives Google search results, you can set one of the environment variables to block particular websites from search results.
yeah, the GP really reads like it was regurgitating someone's notes that attended an internal Googs meeting on why they are ranking new higher as their mantra
a lot of content is just no longer produced and exposed to the open web (think about how much content goes into tiktok, discord, etc and you will never get that into your search results)
I see this all the time when trying to find information about old computers. So many of the good vintage computing resources are locked in social services or mailing lists that the information never shows up in search engines.
It feels a lot like the days when information was balkanized between AOL, GEnie, CompuServe, American PeopleLink, Delphi, etc.
Search engines were supposed to fix that and make all the world's information discoverable. They didn't.
There certainly is content. Often the ones I could find two years ago but now cannot.
That's because the web is full of juvenile sub-normie content such as geeksforgeeks (if you consider programming topics for example). It shadows the very specific queries with highly SEO'd Juvenile stuff.
The winners of this battle will be the places where content is generated (or curated) - and reddit is perhaps the most important content hub generator (it's not just an aggregator anymore - comments about some news can be often more interesting than the news itself). Indexers (and language models) are useless without content to scrap.