Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is utterly fascinating.

To be clear -- it stores a low-res version in the output file, uses neural networks to predict the full-res version, then encodes the difference between the predicted full-res version and the actual full-res version, and stores that difference as well. (Technically, multiple iterations of this.)

I've been wondering when image and video compression would start utilizing standard neural network "dictionaries" to achieve greater compression, at the (small) cost of requiring a local NN file that encodes all the standard image "elements".

This seems like a great step in that direction.



First author here. First of all thanks so much for the interest in SReC! It was a pleasant surprise seeing my research on top of Hacker News. Answering a few questions from reading the comments:

How is this lossless? The entropy coder is what makes this technique lossless. The neural network predicts a probability distribution over pixels, and the entropy coder can find a near optimal mapping of pixel values to bits based on those probabilities (near optimal according to Shannon’s entropy).

On practicality of the method, I don’t expect SReC to replace PNGs anytime soon. The current implementation is not efficient for client-side single-image decoding because of the cost of loading a neural network into memory. However, for decoding many high-quality images, this is efficient because the memory cost is amortized. Additionally, the model size can be reduced with a more efficient architecture and pruning/quantization. Finally, as neural networks become more popular in image-related applications, I think the hardware and software support to run neural nets client-side efficiently will get better. This project in its current form is just a proof of concept that we can get state-of-the-art compression rates using neural network. The previous practical neural network-based approach (L3C) was not able to beat FLIF on Open Images.

For a detailed explanation of how SReC works and results, please refer to our paper: https://arxiv.org/abs/2004.02872.

Btw, SReC is pronounced “Shrek”, because both ogres and neural nets have layers ;).


Hi, and thanks for your interesting work.

Isn't the quality of the prediction heavily influenced by how common the encoded content is?


Yes. However, we only care about compressing natural images and they are not a big subset of the space of all possible images. In practice, we find that neural networks are quite good at making predictions on pixel values, especially when we frame the problem in terms of super-resolution.


Even though the implementation details are far from trivial, the general idea is fairly typical. Most advanced compression algorithms work the same way.

- Using the previously decoded data, try to predict what's next, and the probability or being right

- Using an entropy coder, encode the difference between what is predicted and the actual data. The predicted probability will be used to define how many bits to assign to each possible value. The higher the probability, the less bits will be used for a "right" answer and the more bits will be used for a "wrong" answer.

Decoding works by "replaying" what the encoder did.

The most interesting part is the prediction. So much that some people think of compression as a better test for AIs that the Turing test. You are basically asking the computer to solve one of these sequence-based IQ tests.

And of course neural networks are one of the first thing we tend to think of when we want to implement an AI, and unsurprising it is not an uncommon approach for compression. For instance, the latest PAQ compressors use neural networks.

Of course, all that is a general idea. How to do it in practice is where the real challenge is. Here, the clever part is to "grow" the image from low to high resolution. Which kinds of reminds me of wavelet compression.


> encode the difference between what is predicted and the actual data

Minor nitpick: In the idealized model there is no single prediction that you can take the difference with. There is just a probability distribution and you encode the actual data using this distribution.

Taking the difference between the most likely prediction and the actual data is just a very common implementation strategy.


How is "the most likely prediction" not a "single prediction that you can take the difference with"?


What if there are two (nearly) equally likely predictions? What if there are N, and they are not close together?

"A single prediction that you can take the difference with" makes some assumptions about the shape of your distribution, at least under any reasonable model of "code the difference" where, e.g., probability decreases as the difference gets larger. These are often very good assumptions, to be fair.


Anyone interested in this approach to lossless compression should visit (and try to win!) Hutter Prize website[1][2]. The goal is to compress 1GB of english wikipedia

[1] http://prize.hutter1.net/

[2] https://en.wikipedia.org/wiki/Hutter_Prize


I don't think anyone has yet applied neural net approaches to the hutter prize with success.

The trained neural network weights tend to be very large, and hard to compress, and the hutter prize requires the size of them to be counted too.

To win the hutter prize, you'd probably need some kind of training-during-inference system.



I wonder if a larger initial dataset (say 5G or 10G) might lead to better overall % compression.


The actual difference here is the encoding, into a representation of a dataset annotated by thousands of people. Sounds like a basis of knowledge or even understanding.

I bet this scales way better than any other method on large datasets


Indeed very fascinating.

Reminds me of doing something similar, albeit a thousand times dumber in ~2004 when I had to find a way to "compress" interior automotive audio data, indicator sounds, things like that. At some point instead of using traditional copression, I synthesized a wave function and and only stored its parameters and the delta from the actual wave which achieved great compression ratios. It was expensive to compress but virtually free to decompress. And as a side effect my student mind was forever blown by the beauty of it.


It's a really cool idea, but I don't know if this would ever be a practical method for image compression. First of all, you could never change the neural network without breaking the compression, so you can't ever "update" it. Like: what if you figure out a better network? Too bad! I mean, I guess you could, but then you need to to version the files and keep copies of all the networks you've ever used, but this gets messy quick.

And speaking of storing the networks: I don't know that you would ever want to pay the memory hit that it would take to store the entire network in memory just to decompress images or video, nor the performance hit the decompression takes. The trade-off here is trading reduced drive space for massively increased RAM and CPU/GPU time. I don't know any case where you'd want to make that trade-off, at least not at this magnitude.

Again though: it's an awesome idea. I just don't know that's ever going to be anything other than a cool ML curiosity.


> First of all, you could never change the neural network without breaking the compression, so you can't ever "update" it. Like: what if you figure out a better network? Too bad!

Isn’t this just a special version of a problem any type of compression will always have? There’s all kinds of ways you can imagine improving on a format like JPEG, but the reason it’s useful is because it’s locked down and widely supported.


Usual compression standards are mostly adaptive, estimating statistical models of input from implicit prior distributions (e.g. the probability of A followed by B begins at p(A)p(B)), reasonable assumptions (e.g. scanlines in an image follow the same distribution), small and fixed tables and rules (e.g. the PNG filters): not only a low volume of data, but data that can only change as a major change of algorithm.

A neural network that models upscaling is, on the other hand, not only inconveniently big, but also completely explicit (inviting all sorts of tweaking and replacement) and adapted to a specific data set (further demanding specialized replacements for performance reasons).

Among the applications that are able to store and process the neural network, which is no small feat, I don't think many would be able to amortize the cost of a tailored neural network over a large, fixed set of very homogeneous images.

The imagenet64 model is over 21 MB: saving 21 MB over PNG size, at 4.29 vs 5.74 bpp (table 2a in the article), requires a set of more than 83 MB of perfectly imagenet64-like PNG images, which is a lot. Compressing with a custom upscaling model the image datasets used for neural network experiments, which are large and stable, is the most likely good application (with the side benefit of producing useful and interesting downscaled images for free in addition to compressing the originals).


Even if it's not useful for general-purpose compression, it may still be useful in a more restricted domain. In text compression, Brotli can be found in Chrome with a dictionary that is tuned for HTTP traffic. And in audio compression, LPCnet is a research codec that used Wavenet (neural nets for speech synthesis) to compress speech to 1.6kb/s (prior discussion from 2019 at https://news.ycombinator.com/item?id=19520194).


For a standard network, you're right there would only be one version. So you just make sure it's very carefully put together. (If a massively better one comes along, then you just make it a new file format.)

And as for performance/resources -- great point. But what about video, where the space/bandwidth improvements become drastically more important?

Since h.264 and h.265 already has dedicated hardware, would it be reasonable to assume that a chip dedicated to this would handle it just fine?

And that if you've already got hardware for video, then of course you'd just re-use it for still images?


>(If a massively better one comes along, then you just make it a new file format.)

I guess you could have versioning of your file format, and some sort of organization that standardized it.


Then you get the layperson who doesn’t understand that and asks why their version 42 .imgnet won’t open in a program only supporting up to 10 (but they don’t know their image is v42 and the program only supports v10). It’s easier to understand different formats more than different versions


You'd think the program that only supports up to v10 should be able to emit an error saying that.


I think the idea is the network is completely trained and encoded along with the image and delta data. A new network would just require retraining and storing that new network along with the image data. It doesn't use a global network for all compressions.


I don't think this would work, the size of the network would likely dominate the size of the compressed image.


Wouldn't the network be part of the decoder?


Yes, and this is why you couldn't update the network. Still, much like how various compression algos have "levels," this standard could be more open in this regard, adding new networks (sort of what others above refer to as versions) and the image could just specify which network it uses. Maybe have a central repo from where the decoder could pull a network it doesn't have (i.e. I make a site and encode all 1k images on it using my own network, pull the network to your browser once so you can decode all 1k images). And even support a special mode where the image explicitly includes the network to be used for decoding it along with image data (could make sense for a very large images, as well as for specialized/demonstrational/test purposes).

All in all, a very interesting idea.


I wonder what the security implications of all this is, sounds dangerous to just run any old network. I suppose maybe if it's sandboxed enough with very strongly defined inputs and outputs then the worst that could happen is you get garbled imagery?


they include the trained models under the "model weights" section. imagenet is ~20mb, openimages is ~17mb.

now this might be prohibitive for images over the web, but it'd be interesting whether it might be applicable for images with huge resolutions for printing, where single images are are hundreds of megabytes


Why can't you update it?

There could be a release of a new model every 6 months or something (although even that is probably too often, the incremental improvement due to statistical changes in the distribution of images being compressed isn't likely to change much over time), and you just keep a copy of all the old models (or lazily download them like msft foundation c++ library versions when you install an application).

The models themselves aren't very large.


I don't know why this comment was downvoted - it's a legitimate question.

One scenario I can picture is the Netflix app on your TV. Firstly, they create a neural network trained on the video data in their library and ship it to all their clients while they are idle. They could then stream very high-quality video at lower bandwidth than they currently use and, assuming decoding can be done quickly enough, provide a great experience for their users. Any updates to the neural network could be rolled out gradually and in the background.


Google used to do something called SDCH (Shared Dictionary Compression for HTTP), where a delta compression dictionary was downloaded to Chrome.

The dictionary had to be updated from time to time to keep a good compression rate as the Google website changed over time. There was a whole protocol to handle verifying what dictionary the client had and such.


Not just that, but you could take a page out of the "compression" book and treat the NN as a sort of dictionary in that it is part of the compressed payload. Maybe not the whole NN, but perhaps deltas from a reference implementation, assuming the network structure remains the same and/or similar.


> I don't know that you would ever want to pay the memory hit that it would take to store the entire network in memory just to decompress images or video, nor the performance hit the decompression takes.

The big memory load wouldn't neccesarily be a problem for the likes of Youtube and Netflix - they could just have dedicated machines which do nothing else but decoding. The performance penalty could be a killer though.


If you've got a big enough image you can include the model parameters with the image.


There is already a startup that makes a video compression codec based on ML - http://www.wave.one/video-compression - I am personally following their work because I think it's pretty darn cool.


There's also TVeon

https://tveon.com


It's an old idea really, or a collection of old ideas with a NN twist. Not really clear how much that latter bit brings to the table but interesting to think about.

The "dictionary" approach was roughly what vector quantization was all about. The idea of turning lossy encoders into lossless by also encoding the error is a old one too, but somewhat derailed by focus on embedable codecs with an ideal of each additional bit read will improve your estimate.

I think the potentially novelty here is really in the unfortunately-named-but-too-late-now super-resolution aspects. You could do the same sort of thing ages ago with say IFS projection, or wavelet (and related) trees, or VQ dictionaries with a resolution bump, but they were limited by the training a bit (although this approach might have some overtraining issues that make it worse for particular applications.


Great explanation. If it is and stays lossless, it would make an awesome photo archiving and browsing tool.

Browse thumbnails, open original. Without any processes to generate / keep in sync these files.


The jpeg2000 standard allows for multiscale resolutions by using wavelet transforms instead of the discrete cosine transformation


The majority of photos you already have most likely contain thumbnail and larger preview images embedded in the EXIF header.

Raw images typically contain an embedded, full-sized JPEG version of the image as well.

All of these are easily extracted with `exiftool -b -NameOfBinaryTag $file > thumb.jpg`.

I've found while making PhotoStructure that the quality of these embedded images are surprisingly inconsistent, though. Some makes and models do odd things, like handle rotation inconsistently, add black bars to the image (presumably to fit the camera display whose aspect ratio is different from the sensor), render the thumb with a color or gamma shift, or apply low quality reduction algorithms (apparent due to nearest-neighbor jaggies).

I ended up having to add a setting that lets users ignore these previews or thumbnails (to choose between "fast" and "high quality").


The point is to have originals available at a good compression rate. Having a thumbnail in the original sucks, as I don’t want lossy compression on my originals.


Here's great book on this http://mattmahoney.net/dc/dce.html

The guy was using neural networks for compression long time ago before it was a thing again.

EDIT: Oh, somebody mentioned it already (but it's really good, free & totally worth reading)


My interpreataion: Create and distribute a library of all possible images (except ones which look like random noise or are otherwise unlikely to ever be needed). When you want to send an image, find it in the library and send its index instead. Use advanced compression (NNs) to reduce the size of the library.


Of the papers at Mahoney's page [0], "Fast Text Compression with Neural Networks" dates to 2000; people have been applying these techniques for decades.

[0] http://mattmahoney.net/dc/


so you could say it precomputes a function (and it's inverse) which allows computing a very space-efficient information-dense difference between a large image and its thumbnail?


This technique has also been used in the ogg-opus audio codec.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: