Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Yes, very good point. Self-driving maximalists who believe that self-driving will be solved with more data need to realize that ChatGPT was trained with ALL the data possible and is still deficient. This defect is probably inherent to existing neural net models and a leap forward of some sort is necessary to solve this.

This is the thing that bugs me about ChatGPT4 which everyone says is a lot better. Did they fix the underlying issues or does it just have more data?

If it's the latter, that means if it's force to operate outside of its "domain" it's going to produce rubbish again - and heaven knows where the limits of its current domain are.

These AIs need to not catastrophically fail if they are missing information.

IMHO in order for AI to be truly useful, we need to be able to trust it. I can't trust something that produces rubbish wherever it's out of its depth instead of just saying "I don't know."



I used GPT-4 for an interview problem from leetcode out of curiosity. It got it right, very quickly, yay!

Then I asked it to modify it by eliminating one of the constraints on the problem. It did a very convincing "Ah, if we need [that] we need to do [this] and output a new version... that didn't actually work right.

I pointed out the specific edge case, it said "you are correct, for that sort of case we have to modify it" and then spit out exactly the same code as the last attempt.

The most interesting thing to me there isn't that it got it wrong - it's that spitting out exactly the same output without realizing it, while saying that you are going to do something different, is the clearest demonstration I've seen from it that it doesn't "understand" in human-like ways.

Extremely powerful and useful, but VERY important for users to know where it runs into the wall. Since it often won't tell you on its own.


These models are designed to produce a _plausible_ text output for a given prompt. Nothing more.

They are not designed to produce a _correct_ text output to a question or request, even if sometimes the output is correct. These proverbial stopped clocks might be correct more than twice a day, but that's just the huge training set speaking.


Are you taking the RLHF into account when you say so?


Well, I wasn't, but if you look at the top most comment of this thread [0] you'll see that considering the level of human reinforcement being demonstrated only reinforces my point.

[0] https://news.ycombinator.com/item?id=36013017


Taking RLHF into account: it's not actually generating the most plausible completion, it's generating one that's worse.


Wow, reading this thread dispelled any doubt I might have had about the hedonistic treadmill.

Can you imagine having this conversation a year ago? And already there are pronouncements all over this thread that the current problems are 'intrinsic' to the approach. I'm not as readily convinced that the improvement is slowing down. Regularization is a powerful thing.


I was confused by the term, https://en.wikipedia.org/wiki/Hedonic_treadmill but it refers to the concept of humans basically adapting to anything and that the "new normal" can be an Overton Window away or more.

Couple in some Corn Pone Opinions (Twain) and constantly moving the Goal Posts (fallacy) or making grand proclamations without any evidence, even all of that is proof that we are living in interesting times.

Not to be a fence sitter, but things are moving so quickly that it is impossible to make predictions in anything but the current level of chaos. Anyone who makes predictions right now is suspect.


I didn't say anything about whether or not I expect it to get better (translation from English to code doesn't seem like an insurmountable task based on what these do so far), but I think that cuts both ways.

For every "leap to a conclusion that some things will never be fixed" there's a "leap to a conclusion that this is already some sort of more general intelligence it is."

And that's really key to my main point. The only way to avoid either of those is to actually use the things and see what does and doesn't work. That's a million times more interesting than just unrealistic hype or hate comments.


Cahtgpt is quite good for known problems before 2022, since the questions got into the training set. It's quite bad for new interview questions though.


I find GPT-4 to be very useful almost daily. I can often spot hallucinations quickly, and they are otherwise easy enough to verify. If I can get a single new perspective or piece of relevant information from an interaction with it, then that is very valuable.

It would be significantly more useful if it were more grounded in reality though… I agree with you there.


How do you know you spot the hallucinations, and that you're not just catching the less-good ones while accepting convincing half-truths? It may be that your subject is just that clear-cut, and you've been careful — but what I worry about is that people won't be, and will just accept the pretty-much correct details that don't really matter that much, until they accrete into a mass of false knowledge, like the authoritative errors quoted in Isadore of Seville's Encyclopedia and similar medieval works.


I think it's enormously useful as a tool paired with a human who has decent judgment. I think it would be useless on its own. I'm constantly impressed by how useful it is, but I'm also constantly mystified by people who claim to be getting this feeling of talking to a "real" intelligence; it doesn't feel that way to me at all.


On the contrary, the "hallucinations" are often very hard to spot without expert knowledge. The output is often plausible but wrong, as shown by Knuth's questions.


> IMHO in order for AI to be truly useful, we need to be able to trust it. I can't trust something that produces rubbish wherever it's out of its depth instead of just saying "I don't know."

I wholeheartedly agree. what we have now is a very capable and convincing liar.


> what we have now is a very capable and convincing liar.

I think things might get even wilder once companies start allowing advertisers to influence chat results like they do with search. Imagine a capable and convincing liar who has an ulterior motive when it talks to you.


It cannot tell the truth, because it does not have the context or understanding of what is true or incorrect.

It is less a liar (who intends to mislead) and instead a fantastic bullshitter who just talks and sounds convincing.


> IMHO in order for AI to be truly useful, we need to be able to trust it.

A common response to this by AI advocates is to point out that humans lie all the time, as long as the AI lies less than humans (debatable at this current point anyway) its an improvement.

I think what that forgets is the importance of context. We all know humans are perfectly capable of lying, but we don't generally expect that of software. If your compiler lied about your code being valid, I doubt the general response would be "meh, its only done that once, I've lied far more than that"


> A common response to this by AI advocates is to point out that humans lie all the time

That’s true. But when someone lies frequently, we stop trusting them.


The other difference is that over time we build up a network of people we consider to be knowledgeable and honest. Current LLMs can never match that because their output is controlled guessing.


> A common response to this by AI advocates is to point out that humans lie all the time, as long as the AI lies less than humans (debatable at this current point anyway) its an improvement.

This is also Elon Musk's justification for self-driving cars: "They make fewer mistakes than humans and are therefore safer."

It's true that self-driving cars avoid many of the mistakes of human drivers, but they also invent whole new categories of fatal mistakes that humans rarely make. And that's why Musk's argument is garbage.


I don't even think they make less mistakes than humans period: they usually compare numbers against all driving instances including those performed by incapacitated humans (drunk or extremely tired human drivers make the bulk of the "mistakes", but humans can—somewhat—control whether they do any driving then).


If the goal is to reduce the number of fatal mistakes, why is that argument garbage?


Because it's unacceptable to replace a perfectly good driver in control of their vehicle with a vehicle that might just randomly kill them.

Traffic accidents don't happen randomly at all. If you are not too tired, drunk or using any substances, and not speeding, your chances of causing a serious traffic accident are miniscule.

These are all things you can control (one way or another). You can also adjust your driving to how you are feeling (eg take extra looks around you when you are a bit tired).


This feels like the trolley problem applied at scale. Will you deploy a self driving system that is perfect and stops all fatal accidents but kills one randomly selected person everyday?


Nope: there is no moral justification to potentially kill a person not participating in the risky activity of driving just so we could have other people be driven around.

Would you sign up for such a system if you can volunteer to participate in it, with now those random killings being restricted to those who've signed up for it, including you?

In all traffic accidents, there is some irresponsibility that led to one event or the other, other than natural disasters that couldn't be predicted. A human or ten is always to blame.

Not to mention that the problems are hardly equivalent. For instance, a perfect system designed to stop all accidents would likely have crawled to a stop: stationary vehicles have pretty low chances of accidents. I can't think of anyone who would vote to increase their chances of dying without any say in it, and especially not as some computer-generated lottery.


> Would you sign up for such a system if you can volunteer to participate in it, with now those random killings being restricted to those who've signed up for it, including you?

I mean, we already have. You volunteer to participate in a system where ~40k people die in the US every year by engaging in travel on public roadways. If self-driving reduces that to 10k, that's a win. You're not really making any sense.


But none of that is random.

Eg. NYC (population estimate 8.3M) had 273 fatalities in 2021 (easy to find full year numbers for): https://www.triallaw1.com/data-shows-2021-was-the-deadliest-...

USA (population estimate 335M) had 42,915 (estimated) according to https://www.nhtsa.gov/press-releases/early-estimate-2021-tra...

USA-wide rate is 1 in 7,800 people dying in traffic accidents yearly, whereas NYC has a rate of 1 in 30,000. I am sure it's even lower for subway riders vs drivers. Even drivers, somebody doing 4k miles a year has different chances than somebody doing 40k. People usually adapt their driving style after having kids which also reduces the chances of them being in a collision.

Basically, your life choices and circumstances influence your chances of dying in a traffic accident.

At the extreme, you can go live on a mountaintop, produce your own food and not have to get in contact with a vehicle at all (and some cultures even do).

FWIW, I responded to a rethorical question about killings being random: they are not random today, even if there is a random element to them!

If you want to sign up to a completely random and expected chance of death that you can't influence at all, good luck! I don't.


In traffic incidents, humans drivers are rarely held accountable. It is notoriously difficult to get a conviction for vehicular manslaughter. It is almost always ruled an accident, and insurance pays rather than the human at fault.

Traffic fatalities often kill others, not just the car occupants. Thus, if a self-driving system causes half as many fatalities as a human, shouldn't the moral imperative be to increase self-driving and eventually ban human driving?


> If you are not too tired, drunk or using any substances, and not speeding, your chances of causing a serious traffic accident are miniscule.

You realize that like.. other people exist, right?


You realize that I said "causing"?

For people to die in a traffic accident, there needs to be a traffic accident. They are usually caused by impaired humans, which means that they are very often involved in traffic accidents (basically, almost all of them have at least one party of the sort), whereas non-impaired people mostly do not participate in traffic accidents as often.

This is a discussion of chances and probabilities: not being impaired significantly reduces your chance of being in a traffic accident since being impaired significantly increases it. I am not sure what's unclear about that?


More importantly humans have ways to detect deception from other humans, be it through body language or other cues. With only text it is very hard to determine whether the model is lying to you or not.


Even in text, there is more context. For example, I am more likely to trust the wikipedia article about a deeply technical topic than an article about politics or a celebrity, because the technical article is far more likely to only be edited by people who are actually very knowledgeable on the topic, and there is very little incentive to lie (in general, there are exceptions).


> If your compiler lied about your code being valid, I doubt the general response would be "meh, its only done that once, I've lied far more than that"

Any language with an unsound type system will do this occasionally. This probably includes a majority of all code being written today: C, Java, and Typescript are all unsound.


I suspect he posited trust in juxtaposition to reliability, rather than veracity.


I've been thinking about this lately and it seems to me that what these models are very good at is generating text that has the right structure, but of all the permutations with the right structure only a few actually contain useful and correct information and it only hits on those by chance.

And, since the real value in communication is the information contained, that puts a fairly low ceiling on the value of their output. If it can't be trusted without careful review by someone that really understands the subject and can flag mistakes then it can never truly replace people in any role where correctness matters and that's most of the roles with a lot of economic value.


If that were the case, outputs would be consistently nonsense - the number of possible variations of text like "colorless green ideas sleep furiously" is so much larger than the meaningful subset, the probability of hitting the latter by chance would be zero for all practical purposes.


Only if the words were chosen simply at random in sequence and of course they're not this simplistic. They're constrained by the attention models so they do much better than this but they're still random. You can control the degree of randomness with the temperature knob.


This part about "constrained by the attention model" is doing a lot of implicit work here to dodge the question why GPT-4 can verifiably reason about things in text.


It also demonstrably is either flat out wrong about a lot of things or completely invents things that don't exist. It's a random process that sometimes generates content with actual informational value but the randomness is inherent in the algorithm.


> And, since the real value in communication is the information contained, that puts a fairly low ceiling on the value of their output. ...then it can never truly replace people in any role where correctness matters and that's most of the roles with a lot of economic value.

I think the thrust of your argument is correct: tasks where correctness matters are inherently less suited to AI automation. But I think that's more a matter of trying to use an LLM for a job that it is the wrong tool for. I think there are many economically valuable roles that are outside of that limited zone, and there will be a lot of people using AI for what AI is good at while the rest of us complain about the limitations when trying to use it for what it isn't good at. (I do a lot of that too.)

Which is probably a waste of time and energy that could be better spent learning how to effectively use an LLM rather than trying to push it in directions that it is incapable of going.

I haven't played much with LLMs yet, so I personally don't have a great sense for what it is good at, and I haven't come across anyone else with a good rundown of the space either. But some things are becoming clear.

LLMs are good at the "blank page" problem, where you know what you want to do but are having a hard time getting started with it. An LLM-generated starting point need not be correct to be useful, and in fact being incorrect can be an advantage since the point is what it triggers in the human's brain.

LLMs are good at many parts of programming that humans are weak at. Humans tend to need to have a certain level of familiarity and comfort with a framework or tool in order to even begin to be productive in it, and we won't use more advanced features or suitable idioms until we get into it enough. An LLM's training data encompasses both the basic starting points as well as more sophisticated uses. So it can suggest idiomatic solutions to problems up front, and since the human is deciding whether and how to incorporate them, correctness is only moderately important. An incorrect but idiomatic use of a framework is close to a correct idiomatic use, while a human-generated correct but awkward use can be very far away from a correct idiomatic use.

Image generation seems similar. My impression is that Midjourney produces good looking output but is fairly useless when you need to steer it to something that is "correct" with respect to a goal. It's great until you actually need to use it, then you have to throw it out. Stable diffusion produces lower quality output but is much more steerable towards "correctness", which requires human artistic intervention.

So there seems to be a common theme. Something like: LLMs are highly useful but require a human to steer and provide "correctness", whatever that might mean in a particular domain.


I agree. I think they will be useful for a lot of things and in some domains you can probably get away with using their output verbatim. But I also think that a lot of people are getting caught up in the hype right now and we're going to see them get used without enough supervision in areas where they really need it.


If AI "lies" less than the top Google hit on the prompt, then it's progress.


Google doesn’t really “lie” though, it gives you the source and allows you to make a decision about its authenticity instead of masking it.


Moreover, Google doesn't cite false sources or obfuscate what link you're visiting, or claim a page says something it doesn't.


You forgot the sarcasm tag.


We get multiple hits from Google (though not always ranked by merit). We can scan a few and we often find forum style threads containing valuable elaboration or criticism of the primary material.


> IMHO in order for AI to be truly useful, we need to be able to trust it

Disagree, but perhaps we have different ideas of "useful". I think automated systems including AI can be very useful but that executive decisions yielded by nondeterministic processes (such as AI) must be signed off by a human and that usage should be mindful of inherent limitations. This includes cross-checking factual claims with sources and verifying produced code works - just as you would (I hope) with a forum comment or Stackoverflow answer before publishing it as fact or pushing it to production.

So I'd rather say: In order for AI to be truly useful, we need to be able to work with it with never trusting it. Let go of unsupervised execution.


> Did they fix the underlying issues or does it just have more data?

IIRC they do have slightly more data, but that's not the primary cause of improvement, the key factor is simply more parameters and more training. No significant actions have been taken "fix the underlying issues" - you should assume that any major differences between GPT-2 (which is horrible in comparison to GPT-3) and GPT-4 are emergent behavior from the model having more horsepower.


Unfortunately trusting something with capabilities that generalize isn't an easy thing to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: