Thank you for the recommendation! That I couldn't sign up using a form and I had to "talk to their team" was a turn-off for my (extremely extroverted) self.
The text rendering is quite impressive, but is it just me or do all these generated 'realistic' images have a distinctly uncanny feel to it. I can't quite put my finger on it what it is, but they just feel off to me.
I agree. They makes me nauseous. The same kind of light nausea as car sickness.
I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.
I dont feel good staring at it now so I had to stop.
Qwen always suffered from their subpar rope implementation and qwen 2 seems to suffer from it as well. The uncanny feel is down to the sparsity of text to image token and the higher in resolution you go the worse it gets. It's why you can't take the higher ends of the MP numbers serious no matter the model. At the moment there is no model that can go for 4k without problems you will always get high frequency artifacts.
I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.
The blur isn't correct though. Like the amount of blur is wrong for the distance, zoom amount etc. So the depth of field is really wrong even if it conforms to "subject crisp, background blurred"
My personal mechanistic understanding of diffusion models is that, "under the hood", the core thing they're doing, at every step and in every layer, is a kind of apophenia — i.e. they recognize patterns/textures they "know" within noise, and then they nudge the noise (least-recognizable pixels) in the image toward the closest of those learned patterns/textures, "snapping" those pixels into high-activation parts of their trained-in texture-space (with any text-prompt input just adding a probabilistic bias toward recognizing/interpreting the noise in certain parts of the image as belonging to certain patterns/textures.)
I like to think of these patterns/textures that diffusion models learn as "brush presets", in the Photoshop sense of the term: a "brush" (i.e. a specific texture or pattern), but locked into a specific size, roughness, intensity, rotation angle, etc.
Due to the way training backpropagation works (and presuming a large-enough training dataset), each of these "brush presets" that a diffusion model learns, will always end up learned as a kind of "archetype" of that brush preset. Out of a collection of examples in the training data where uses of that "brush preset" appear with varying degrees of slightly-wrong-size, slightly-wrong-intensity, slightly-out-of-focus-ness, etc, the model is inevitably going to learn most from the "central examples" in that example cluster, and distill away any parts of the example cluster that are less shared. So whenever a diffusion model recognizes a given one of its known brush presets in an image and snaps pixels toward it, the direction it's moving those pixels will always be toward that archetypal distilled version of that brush preset: the resultant texture in perfect focus, and at a very specific size, intensity, etc.
This also means that diffusion models learn brushes at distinctively-different scales / rotation angles / etc as entirely distinct brush presets. Diffusion models have no way to recognize/repair toward "a size-resampled copy of" one of their learned brush presets. And due to this, diffusion models will never learn to render in details small enough that the high-frequency components of of their recognizable textural-detail would be lost below the Nyquist floor (which is why they suck so much at drawing crowds, tiny letters on signs, etc.) And they will also never learn to recognize or reproduce visual distortions like moire or ringing, that occur when things get rescaled to the point that beat-frequencies appear in their high-frequency components.
Which means that:
- When you instruct a diffusion model that an image should have "low depth-of-field", what you're really telling it is that it should use a "smooth-blur brush preset" to paint in the background details.
- And even if you ask for depth-of-field, everything in what a diffusion model thinks of as the "foreground" of an image will always have this surreal perfect focus, where all the textures are perfectly evident.
- ...and that'll be true, even when it doesn't make sense for the textures to be evident at all, because in real life, at the distance the subject is from the "camera" in the image, the presumed textures would actually be so small as to be lost below the Nyquist floor at anything other than a macro-zoom scale.
These last two problems combine to create an effect that's totally unlike real photography, but is actually (unintentionally) quite similar to how digital artists tend to texture video-game characters for "tactile legibility." Just like how you can clearly see the crisp texture of e.g. denim on Mario's overalls (because the artist wanted to make it feel like you're looking at denim, even though you shouldn't be able to see those kinds of details at the scaling and distance Mario is from the camera), diffusion models will paint anything described as "jeans" or "denim" as having a crisply-evident denim texture, despite that being the totally wrong scale.
It's effectively a "doll clothes" effect — i.e. what you get when you take materials used to make full-scale clothing, cut tiny scraps of those materials to make a much smaller version of that clothing, put them on a doll, and then take pictures far closer to the doll, such that the clothing's material textural detail is visibly far larger relative to the "model" than it should be. Except, instead of just applying to the clothing, it applies to every texture in the scene. You can see the pores on a person's face, and the individual hairs on their head, despite the person standing five feet away from the camera. Nothing is ever aliased down into a visual aggregate texture — until a subject gets distant enough that the recognition maybe snaps over to using entirely different "brush preset" learned specifically on visual aggregate textures.
Right, prompting for depth of field will never work (with current models) because it treats it as a style rather than knowing on some level how light and lenses behave. The model needs to know this, and then we can prompt it with the lens and zoom and it will naturally do the rest. Like how you prompt newer video models without saying "make the ball roll down the hill"
I spent more than an hour writing the above comment, with my own two human hands, spending real thinking time on inventing some (AFAIK) entirely-novel educational metaphors to contribute something unique to the discussion. And you're going to ignore it out-of-hand because, what, you think "long writing" is now something only AIs do?
Kindly look at my commenting history on HN (or on Reddit, same username), where I've been writing with exactly this long and rambling overly-detailed "should have been a blog post" style going on 15+ years now.
Then, once you're convinced that I'm human, maybe you'll take this advice:
A much more useful heuristic for noticing textual AI slop than "it's long and wordy" (or "it contains em-dashes"), is that, no matter how you prompt them, LLMs are constitutionally incapable of writing run-on sentences (like this one!)
Basically every LLM base model at this point, has been RLHFed by feedback from a (not necessarily native-English-speaking, not necessarily highly literate) userbase. And that has pushed the models toward a specific kind of "writing for readability", that aims for a very low lowest-common-denominator writing style... but in terms of grammar/syntax, rather than vocabulary. These base models (or anything fine-tuned from them) will consistently spew out these endless little atomic top-level sentences — one thought per sentence, or sometimes even several itty-bitty sentences per thought (i.e. the "Not x. Not y. Just z." thing) — that can each be digested individually, with no mental state-keeping required.
It's a very inhuman style of writing. No real human being writes like LLMs do, because it doesn't match the way human beings speak or think. (You can edit prose after-the-fact to look like an LLM wrote it, but I dare you to try writing that way on your first draft. It's next to impossible.)
Note how the way LLMs write, is exactly the opposite of the way I write. My writing requires a high level of fluency with English-language grammar and syntax to understand! Which makes it actually rather shitty as general-purpose prose. Luckily I'm writing here on HN for an audience that skews somewhat older and more educated than the general public. But it's still not a style I would subject anyone to if I bothered to spend any time editing what I write after I write it. My writing epitomizes the aphorism "I wrote you a long letter because I didn't have the time to write you a short one." (It's why these are just HN comments in the first place; if I had the time to clean them up, then I'd make them into blog posts!)
Apologies, I did jump the gun here. There has been more and more lazy LLM replies on HN lately and yours raised a flag in my mind because I can't remember someone commenting that deeply while also agreeing with me (normally if it's a lengthy response it's because they are arguing against my point).
There are some enlightening points here about LLM writing style for me. Trying to write like an LLM being impossible (at least for a non-trivial length of text) is such a good point. Run on sentences as another hint that it's not an LLM is also useful. Thanks!
Which is pretty amusing - because it's the exact opposite problem that BFL had with the original Flux model - every single image looked like it was taken with a 200mm f/4.
For me the only model that can really generate realistic images is nano banana pro (also known as gemini-3-pro-image). Other models are closing the gap, this one is pretty meh in my opinion in realistic images.
The examples I saw of z-image look much more realistic than Nano Banana Pro, which is likely using Imagen 4 (plus editing) internally, which isn't very realistic. But Nano Banana Pro has obviously much better prompt alignment than something like z-image.
In your example, z-image and Nano Banana Pro look basically equally photorealistic to me. Perhaps the NBP image looks a bit more real because it resembles an unstaged smartphone shot with wide angle. Anyway, the difference is very small. I agree the lighting in Flux.2 Pro looks a bit off.
But anyway, realistic environments like a street cafe are not suited to test for photorealism. You have to use somewhat more fantastical environments.
I don't have access to z-image, but here are two examples with Nano Banana Pro:
These are terribly unrealistic. Far more so than the Flux.2 Pro image above.
> Also Imagen 4 and Nano Banana Pro are very different models.
No, Imagen 4 is a pure diffusion model. Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment. The prompts above a very simple, so there is little for Gemini to alter, so they look basically identical to plain Imagen 4. Both pictures (especially the first) have the signature AI look of Imagen 4, which is different from other models like Imagen 3.
By the way, here is GPT Image 1.5 with the same prompts:
The first is very fake and the second is a strong improvement, though still far from the excellent cafe shots above (fake studio lighting, unrealistic colors etc).
>Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment.
First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
> I disagree, nano banana pro result is on a completely different league.
I strongly disagree. But even if you are right, the difference between the cafe shots and the Atlantis shots is clearly much, much larger than the difference between the different cafe shots. The Atlantis shots are super unrealistic. They look far worse than the cafe shots of Flux.2 Pro.
> Why? It's the perfect settings in my opinion
Because it's too easy obviously. We don't need an AI to make fake realistic photos of realistic environments when we can easily photograph those ourselves. Unrealistic environments are more discriminative because they are much more likely to produce garbage that doesn't look photorealistic.
I'm definitely using Nano Banana Pro, and your picture has the same strong AI look to it that is typical of NBP / Imagen 4.
> First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
There were discussions about it previously on HN. Clearly NBP is using Gemini reasoning, and clearly the style of NBP strongly resembles Imagen 4 specifically. There is probably also a special editing model involved, just like in Qwen-Imahe-2.0.
Still the vast majority of models fail at delivery an image that looks real, I want realism for a realistic settings, if it can't do that than what's the point. Of course you can always pay people and equipment to make the perfect photo for you ahah
If the image of z-image turbo looks as good as the nano banana pro one for you, you are probably too used to slop that a model that do not produce obvious artifacts like super shiny skin it's immediately undistinguishable from a real image (like the nano banana pro one that to me looks as real as a real photo) and yes I'm ignoring the fact that in the z-image-turbo the cup is too large and the bag is inside the chair. Z-image is good (in particular given its size) but not as good.
It seems you are ignoring the fact that the NBP Atlantis pictures looks much, much worse than the z-image picture of the cafe. They look far more like AI slop. (Perhaps the Atlantis prompt would look even worse with z-image, I don't know.)
I have generated my own using your prompt and post it in the previous comment. You haven't posted a z-image one of Atlantis. I'm not at home to try but I have trained lora for z-image (it's a relatively lightweight model), I know the model, it's not as good as nano banana pro. Use what you prefer.
> I have generated my own using your prompt and post it in the previous comment.
Yes, and it has a very unrealistic AI look to it. That was my point.
> You haven't posted a z-image one of Atlantis.
Yes, I don't doubt that it might well be just as unrealistic or even worse. I also just tried the Atlantis prompts in Grok (no idea what image model they use internally) and they look somewhat more realistic, though not on cafe level.
> may make provision for the provider of a relevant VPN service to apply to any person seeking to access its service in or from the UK age assurance which is highly effective at correctly determining whether or not that person is a child
"The law we made is like super duper good!!"
> Children may also turn to VPNs, which would then undermine the child safety gains of the Online Safety Act
> may make provision for the provider of a relevant VPN service to apply to any person seeking to access its service in or from the UK age assurance which is highly effective at correctly determining whether or not that person is a child
I think you're reading it wrong. Regulations may have a provision that allows providers to apply age assurance [systems ?] if the age assurance is highly effective at determining age.
I'm always surprised how ambiguous the writing is for this kind of stuff. Maybe that's the point. If the regulations don't (may is optional) have the provision, does that mean they need to demand ID?
IMO, highly effective = our buddies' tech that we declare highly effective. The whole ID push around the world is big tech trying to set up government mandated services that you're going to be forced to pay for, either directly or via taxes.
The end game is probably digital IDs with digitally signed requests for everything you do. And, of course, corrupt individuals and criminals will somehow be able to get as many digital IDs as they want.
That money should be spent on education. We're being robbed.
> I'm always surprised how ambiguous the writing is for this kind of stuff.
That's because--
(a) The actual legislators vaguely realize that they're too lazy and stupid to get anything right in detail, so they delegate to the regulatory apparatus.
(b) Neither the legislators nor the regulators are ever quite sure what they can politically get away with actually demanding, or how fast they can politically get away with moving, so they want both the ability to grab anything that looks like they can hold it, and the ability to deny that they ever meant to ask for anything that's blowing back too hard on them.
(c) Both the legislators and the regulators want to be able to threaten various actors with draconian actions that are at least possibly authorized under that kind of vague language, in order to get concessions that they are not authorized to demand (and that would be too hot politically to give them authorization to demand).
Crusader Kings is a franchise I really could see LLMs shine. One of the current main criticisms on the game is that there's a lack of events, and that they often don't really feel relevant to your character.
An LLM could potentially make events far more aimed at your character, and could actually respond to things happening in the world far more than what the game currently does. It could really create some cool emerging gameplay.
In general you are right, I expect something like this to appear in the future and it would be cool.
But isn't the criticism rather that there are too many (as you say repetitive, not relevant) events - its not like there are cool stories emerging from the underlying game mechanics anymore ("grand strategy") but players have to click through these boring predetermined events again and again.
You get too many events, but there aren't actually that many different events written, so you repeat the same ones over and over again. Eventually it just turns into the player clicking on the 'optimal' choice without actually reading the event.
You could mod the game with more varied events, which were of course AI generated to begin with. Bit of an inception scenario where AI plays an AI modded game.
The other option is to have an AI play another AI which is working as an antagonist, trying to make the player fail. More global plagues! More scheming underlings! More questionable choices for relaxation! Bit of an arms race there.
Honestly I prefer Crusader Kings II if for no other reason that the UI is just so brilliantly insanely obtuse while also being very good looking.
The industrial revolution was constrained by access to the means of production, leaving only those with capital able to actually produce, which lead to new economic situations.
What are the constraints with LLMs? Will an Anthropic, Google, OpenAI, etc, constrain how much we can consume? What is the value of any piece of software if anyone can produce everything? The same applies to everything we're suddenly able to produce. What is the value of a book if anyone can generate one? What is the value of a piece of art, if it requires zero skill to generate it?
I'm fairly certain it was before that, as someone living in The Netherlands we'd always get warned to make sure there was at least 30-60 minute transit time between each stop in Germany when travelling international, as the expectation was that the train would be (extremely) late.
While it is true that that many problems where already visible 10 years ago, it is also true tat during the pandemic more trains were on time because having very few passengers speeds up the boarding/offboarding at stations enormously. So the pandemic somehow delayed the already inevitable fall into the abyss.
The second line. The video description for me says the following:
"HAWAIʻI VOLCANOES NATIONAL PARK - An incredible sight at the summit of Kilauea volcano on Saturday morning, as Episode 38 erupted enormous lava fountains across the caldera, destroying one of the webcams that was live streaming the event.
All images and video are courtesy the U.S. Geological Survey. A synthesized text-to-video voiceover was used in the narration for this story."
reply