How does shodan works like how do they know if something is exposed to the inter...

achillean · on June 24, 2021

Here is an overview of what Shodan is:

https://help.shodan.io/the-basics/what-is-shodan

The scanning algorithm is mostly just this:

1. Generate a random IPv4 address

2. Select a random port from a list of ~2k ports

3. Check the random IP on the random port

4. Store the result of the check

5. GOTO 1

The above loop runs endlessly and because IPv4 is fairly small it doesn't take long to check everything.

hnick · on June 24, 2021

If everyone somehow magically switched to the much larger IPv6 address space would that be a big problem for you?

achillean · on June 24, 2021

We also crawl IPv6 but it's a very different and more complicated algorithm. We would still end up crawling a sizable chunk but there are obviously unknowns in how biased our dataset would be (i.e. we might index mostly cloud servers and fewer residential devices).

hnick · on June 25, 2021

That's what I thought, you might have to maintain a map of ISPs and such but even then it'd be hard to find all the clients under them.

29athrowaway · on June 24, 2021

Is there a way to opt-out from the scans?

zootboy · on June 24, 2021

They do a monthly scan, with additional spot checks available on-demand:

https://help.shodan.io/the-basics/on-demand-scanning

achillean · on June 24, 2021

We actually scan on average once a week. I used that language to be ultra conservative but I'll need to change it. For the past 8+ years we've been doing weekly scans.

exikyut · on June 24, 2021

In case you see this, the most interesting question I think I could possibly ask is, what does the current real-world impact of IPv6 appear to be practically speaking?

Abstractly and intuitively, IPv6's massiveness would seem to put an end to the interesting closed loop of address space vs backhaul capacity that has developed around v4. I can't help but wonder though - with for example some providers leasing out ginormous blocks of address space according to fairly predictable patterns (and customers just using the first v6 address that pops out - if at all), this makes me wonder if it'll be possible to steer v6 scans using a mix of statistics, machine learning, and Perl if statements :).

The other thing I'm idly curious about is how you actually scan on a regular basis. Broadly speaking about long-term viability, I guess the TL;DR probably boils down to coordination and careful nurturing of reputation similar to what the large-scale email providers maintain. But from a technical perspective, I do wonder if/how much things like peering, and BGP, and noise-cancelling routing (if you will), etc, come into the picture - and how big the links are :D

I would be very happy to coincidentally discover writeups touching on these questions anytime. Thanks for reading :)

29athrowaway · on June 24, 2021

They go through the entire ip address range scanning specific ports.

exikyut · on June 24, 2021

The TL;DR is that IPv4 means there's only 4 billion IP addresses, which modern 1-10Gbps backhaul links can ticker-tape through in a few minutes. IPv6's 18,446,744,073,709,551,616 IP address limit will sadly make broad scanning utterly infeasible going forward until users have 1Tbit connections or so :(

But for now we can do this with the v4 parts that are left: https://www.youtube.com/watch?v=nX9JXI4l3-E

Also, some crazy person good-haxed a bunch of routers and modems back in 2012 and made http://census2012.sourceforge.net/paper.html (without access to fast connections and before the advent of masscan and other straightforward tools, too). Of note is that the "Unallocated" grey areas in the the analysis images are a curious illustration of how much less-full the IPv4 internet was just ~9 years ago.