Anubis Works

257 points by evacchi a day ago

gnabgib 21 hours ago

Related Anubis: Proof-of-work proxy to prevent AI crawlers (100 points, 23 days ago, 58 comments) https://news.ycombinator.com/item?id=43427679

raggi 20 hours ago

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

adrian17 11 hours ago

> but of course doing so requires some expertise, and expertise still isn't cheap
Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.
If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.
jtbayly 19 hours ago

My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.
mrweasel 10 hours ago

> I am kind of surprised how many sites seem to want/need this.
The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.
cedws 11 hours ago

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.
- xena 10 hours ago
  
  Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.
  And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.
  - underdeserver 10 hours ago
    
    Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.

gyomu 21 hours ago

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

x3haloed 18 hours ago

I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.
But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?
- xboxnolifes 18 hours ago
  
  > As long as a given client is not abusing your systems, then why do you care if the client is a human?
  Well, that's the rub. The bots are abusing the systems. The bots are accessing the contents at rates thousands of times faster and more often than humans. The bots also have access patterns unlike your expected human audience (downloading gigabytes or terabytes of data multiples times, over and over).
  And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.
  - immibis 11 hours ago
    
    Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.
    And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.
    Actual ISPs and hosting providers take abuse reports extremely seriously, mostly because they're terrified of getting kicked off by their ISP. And there's no end to that - just a chain of ISPs from them to you and you might end with convincing your ISP or some intermediary to block traffic from them. However, as we've seen recently, rules don't apply if enough money is involved. But I'm not sure if these shitty interim solutions come from ISPs ignoring abuse when money is involved, or from not knowing that abuse reporting is taken seriously to begin with.
    Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?
    
    sussmannbaka 11 hours ago
    
    Please, read literally any article about the ongoing problem. The IPs are basically random, come from residential blocks, requests don’t reuse the same IP more than a bunch of times.
    
    immibis 6 hours ago
    
    Are you sure that's AI? I get requests that are overtly from AI crawlers, and almost no other requests. Certainly all of the high-volume crawler-like requests overtly say that they're from crawlers.
    And those residential proxy services cost their customer around $0.50/GB up to $20/GB. Do with that knowledge what you will.
    
    zinekeller 11 hours ago
    
    > Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.
    You will be surprised on how many ISPs will not respond. Sure, Hetzner will respond, but these abusers are not using Hetzner at all. If you actually studied the actual problem, these are residential ISPs in various countries (including in US and Europe, mind you). At best the ISP will respond one-by-one to their customers and scan their computers (and at this point the abusers have already switched to another IP block) and at worst the ISP literally has no capability to control this because they cannot trace their CGNATted connections (short of blocking connections to your site, which is definitely nuclear).
    > And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.
    Again, the IP blocks are rotated, so by the time that they would respond you need to do the whole reporting rigomarole again. Additionally, these ISPs would instead suggest to blackhole these requests or to utilize a commercial solution (aka using Cloudflare or something else), because at the end of the day the residential ISPs are national entites that would quite literally trigger geopolitcal concerns if you disconnected them.
    
    immibis 6 hours ago
    
    These the same residential providers that people complain cut them off for torrenting? You think they wouldn't cut off customers who DDoS?
    
    op00to 5 hours ago
    
    They’re not cutting you off for torrenting because they think it’s the right thing to do. They’re cutting you off for torrenting because it costs them money if rights holders complain.
    
    bayindirh 10 hours ago
    
    When I was migrating my server, and checking logs, I have seen a slew of hits in the rolling logs. I reversed the IP and found a company specializing in "Servers with GPUs". Found their website, and they have "Datacenters in the EU", but the company is located elsewhere.
    They're certainly positioning themselves for providing scraping servers for AI training. What will they do when I say that one of their customers just hit my server with 1000 requests per second? Ban the customer?
    Let's be rational. They'll laugh at that mail and delete it. Bigger players use "home proxying" services which use residental blocks for egress, and make one request per host. Some people are cutting whole countries off with firewalls.
    Playing by old rules won't get you anywhere, because all these gentlemen took their computers and work elsewhere. Now we all have are people who think they need no permission because what they do is awesome, anyway (which is not).
    
    immibis 6 hours ago
    
    A startup hosting provider you say - who's their ISP? Does that company know their customer is a DDoS-for-hire provider? Did you tell them? How did they respond?
    At the minimum they're very likely to have a talk with their customer "keep this shit up and you're outta here"
    
    mrweasel 4 hours ago
    
    > Then you look up their IP address's abuse contact, send an email
    Good luck with that. Have you ever tried? AWS and Google have abuse mails. Do you think they read them? Do you think they care? It is basically impossible to get AWS to shutdown a customers systems, regardless of how much you try.
    I believe ARIN has an abuse email registered for a Google subnet, with the comment that they believe it's correct, but no one answer last time they tried it, three years ago.
  - bbor 16 hours ago
    
    Well, that's the meta-rub: if they're abusing, block abuse. Rate limits are far simpler, anyway!
    In the interest of bringing the AI bickering to HN: I think one could accurately characterize "block bots just in case they choose to request too much data" as discrimination! Robots of course don't have any rights so it's not wrong, but it certainly might be unwise.
    
    inejge 16 hours ago
    
    > Rate limits are far simpler, anyway!
    Not when the bots are actively programmed to thwart them by using far-flung IP address carousels, request pacing, spoofed user agents and similar techniques. It's open war these days.
    
    parineum 15 hours ago
    
    Request pacing sounds intentionally unabusive.
    
    j16sdiz 12 hours ago
    
    It is not bringing down your server, but they are taking 80%+ of your bandwidth budget. Does this count as abuse?
    
    immibis 11 hours ago
    
    Are you at a hoster with extortionately expensive bandwidth, such as AWS, GCP, or Azure?
    
    ithkuil 12 hours ago
    
    Isn't that what a rate limiter would address?
    
    mkl 11 hours ago
    
    Not when the traffic is coming from 10s of thousands of IP addresses, with very few requests from each one: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
    
    KronisLV 10 hours ago
    
    That very much reads like the rant of someone who is sick and tired of the state of things.
    I’m afraid that it doesn’t change anything in of itself and any sorts of solutions to only allow the users that you’re okay with are what’s direly needed all across the web.
    Though reading about the people trying to mine crypto on a CI solution, it feels that sometimes it won’t just be LLM scrapers that you need to protect against but any number of malicious people.
    At that point, you might as well run an invite only community.
    
    bayindirh 10 hours ago
    
    Source Hut implemented Anubis, and it works so well. I mostly never see the waiting screen. And after it whitelists me for a very long time, so I work without any limitations.
    
    KronisLV 10 hours ago
    
    That’s great to hear and Anubis seems cool!
    I just worry about the idea of running public/free services on the web, due to the potential for misuse and bad actors, though making things paid also seems sensible, e.g. what was linked: https://man.sr.ht/ops/builds.sr.ht-migration.md
    
    rollcat 13 hours ago
    
    It's called DDoS. DDoS is abusive.
- praptak 15 hours ago
  
  The good thing about proof of work is that it doesn't specifically gate bots.
  It may have some other downsides - for example I don't think that Google is possible in a world where everyone requires proof of work (some may argue it's a good thing) but it doesn't specifically gate bots. It gates mass scraping.
- t-writescode 18 hours ago
  
  > I personally can’t think of a good reason to specifically try to gate bots
  There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.
  - ronsor 18 hours ago
    
    I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.
    I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?
    
    20after4 16 hours ago
    
    For a long time there have been spammers scraping in search of email addresses to spam. There are all kinds of scraper bots with unknown purpose. It's the aggregate of all of them hitting your server, potentially several at the same time.
    When I worked at Wikimedia (so ending ~4 years ago) we had several incidents of bots getting lost in a maze of links within our source repository browser (Phabricator) which could account for > 50% of the load on some pretty powerful Phabricator servers (Something like 96 cores, 512GB RAM). This happened despite having those URLs excluded via robots.txt and implementing some rudimentary request throttling. The scrapers were using lots of different IPs simultaneously and they did not seem to respect any kind of sane rate limits. If googlebot and one or two other scrapers hit at the same time it was enough to cause an outage or at least seriously degrade performance.
    Eventually we got better at rate limiting and put more URLs behind authentication but it wasn't an ideal situation and would have been quite difficult to deal with had we been much more resource-constrained or less technically capable.
    
    t-writescode 18 hours ago
    
    No matter the source, the result is the same, and these proof of work systems may be something that can help "the little guy" with their hosting bill
    
    ronsor 2 hours ago
    
    If a bot claims to be from an AI company, but isn't from the AI company's IP range, then it's lying and its activity is plain abuse. In that case, you shouldn't serve them a proof of work system; you should block them entirely.
    
    thunderfork an hour ago
    
    [dead]
    
    userbinator 16 hours ago
    
    Also suspect those working on "anti-bot" solutions may have a hand in this.
    What better way to show the effectiveness of your solution, than to help create the problem in the first place.
    
    zaphar 7 hours ago
    
    Why? When there are 100s of hopeful AI/LLM scrapers more than willing to do that work for you what possible reason would you have to do that work? The more typical and common human behavior is perfectly capable of explaining this. No reason to reach for some kind of underhanded conspiracy theory when simple incompetence and greed is more than adequate to explain it.
    
    anonym29 17 hours ago
    
    >Does anyone even check if the bots' IP ranges belong to the AI companies?
    Sounds like a fun project for an AbuseIPDB contributor. Could look for fake Googlebots / Bingbots, etc, too.
- gbear605 18 hours ago
  
  The issue is not whether it’s a human or a bot. The issue is whether you’re sending thousands of requests per second for hours, effectively DDOSing the site, or if you’re behaving like a normal user.
- laserbeam 16 hours ago
  
  The reason is: bots DO spam you repeatedly and increase your network costs. Humans don’t abuse the same way.
- starkrights 13 hours ago
  
  Example problem that I’ve seen posted about a few times on HN: LLM scrapers (or at least, an explosion of new scrapers) exploding and mindlessly crawling every singly HTTP endpoint of a hosted git-service, instead of just cloning the repo. (entirely ignoring robots.txt)
  The point of this is that there has recently been a massive explosion in the amount of bots that blatantly, aggressively, and maliciously ignore and attempt to bypass (mass ip/VPN switching, user agent swapping, etc) anti-abuse gates.
- mieses 11 hours ago
  
  There is hope for misguided humans.

namanyayg 21 hours ago

"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena 20 hours ago

OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.
- didgeoridoo 20 hours ago
  
  I didn’t even blink at this, my inner monologue just did a little “well, naturally” in a Redditor voice and kept reading.
- mkl 20 hours ago
  
  BTW Xe, https://xeiaso.net/pronouns is 404 since sometime last year, but it is still linked to from some places like https://xeiaso.net/blog/xe-2021-08-07/ (I saw "his" above and went looking).
  - xena 20 hours ago
    
    I'm considering making it come back, but it's just gotten me too much abuse so I'm probably gonna leave it 404-ing until society is better.
    
    cendyne 18 hours ago
    
    That's what route-specific Anubis is for.
    
    frontalier 15 hours ago
    
    parent is referring to a different kind of abuse
    
    1oooqooq 11 hours ago
    
    or you just not cranking up the required proof of work effort enough.

AnonC 17 hours ago

Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

[1]: https://github.com/TecharoHQ/anubis/

snvzz 9 hours ago

>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!
Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

roenxi 17 hours ago

I like the idea but this should probably be something that is pulled down into the protocol level once the nature of the challenge gets sussed out. It'll ultimately be better for accessibility if the PoW challenge is closer to being part of TCP than implemented in JavaScript individually by each website.

pona-a 11 hours ago

There's Cloudflare PrivacyPass that became an IETF standard [0], but it's rather weird, and the reference implementation is a bug nest.
[0] https://datatracker.ietf.org/wg/privacypass/about/

prologic 20 hours ago

I've read about Anubis, cool project! Unfortunately, as pointed out in the comments, requires your site's visitors to have Javascript™ enabled. This is totally fine for sites that require Javascript™ anyway to enhance the user experience, but not so great for static sites and such that require no JS at all.

I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.

xyzzy_plugh 18 hours ago

A significant portion of the bot traffic TFA is designed to handle originates from consumer/residential space. Sure, there are ASN games being played alongside reputation fraud, but it's very hard to combat. A cursory investigation of our logs showed these bots (which make ~1 request from a given residential IP) are likely in ranges that our real human users occupy as well.
Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.
As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.
It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.
- prologic 18 hours ago
  
  This is true. I had some bad actors from the ComCast Network at one point. And unfortunately also valid human users of some of my "things". So I opted not to block the ComCast ASN at that point.
  - prologic 18 hours ago
    
    I would be interested to hear of any other solutions that guarantee to either identity or block non-Human traffic. In the "small web" and self-hosting, we typically don't really want Crawlers, and other similar software hitting our services, because often the software is either buggy in the first place (Example: Runaway Claude Bot) or you don't want your sites indexed by them in the first place.
  - xyzzy_plugh 18 hours ago
    
    Exactly. We've all been down this rabbit hole, collectively, and that's why Anubis has taken off. It works shockingly well.
    
    prologic 15 hours ago
    
    I was planning on building a Caddy module for Anubis actually. Is anyone else interested in this?
    
    vinibrito 5 hours ago
    
    Yes, I would! I love Caddy's set and forget nature, and with this it wouldn't be different. Especially if it could be triggered conditionally, for example based on server load or a flood being detected.
Cyphase 19 hours ago

For anyone wondering, Oracle holds the trademark for "JavaScript": https://javascript.tm/
- prologic 19 hours ago
  
  Which arguably they should let go of
jadbox 19 hours ago

How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?
- prologic 19 hours ago
  
  I don't distinguish actually. There are two things I do normally:
  - Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`
  There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).
  To give you an concrete idea, here's some examples:
  $ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134
  # CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837
  # CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808
  # FACEBOOK, US # Why: Bad Bots 32934
  # Alibaba, CN # Why: Bad Bots 45102
  # Why: Bad Bots 28573
runxiyu 13 hours ago

Do you have a link to your own solution?
- prologic 9 hours ago
  
  Not yet unfortunately. But if you're interested, please reach out! I currently run it in a 3-region GeoDNS setup with my self-hosted infra.

mentalgear 10 hours ago

Seems like a great idea, but I'd be nice if the project had a simple description. (and not use so much anime, as it gives an unprofessional impression)

This is what it actually does: Instead of only letting the provider bear the cost of content hosting (traffic, storage), the client also bears costs when accessing in form of computation. Basically it runs additional expansive computation on the client, which makes accessing 1000s of your webpages at high interval expansive for crawlers.

> Anubis uses a proof of work in order to validate that clients are genuine. The reason Anubis does this was inspired by Hashcash, a suggestion from the early 2000's about extending the email protocol to avoid spam. The idea is that genuine people sending emails will have to do a small math problem that is expensive to compute, but easy to verify such as hashing a string with a given number of leading zeroes. This will have basically no impact on individuals sending a few emails a week, but the company churning out industrial quantities of advertising will be required to do prohibitively expensive computation.

tripdout 21 hours ago

The bot detection takes 5 whole seconds to solve on my phone, wow.

bogwog 21 hours ago

I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.
Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.
Hakkin 21 hours ago

Much better than infinite Cloudflare captcha loops.
- gruez 20 hours ago
  
  I've never had that, even with something like tor browser. You must be doing something extra suspicious like an user agent spoofer.
  - praisewhitey 19 hours ago
    
    Firefox with Enhanced Tracking Protection turned on is enough to trigger it.
    
    aaronmdjones 16 hours ago
    
    You need to whitelist challenges.cloudflare.com for third-party cookies.
    If you don't do this, the third-party cookie blocking that strict Enhanced Tracking Protection enables will completely destroy your ability to access websites hosted behind CloudFlare, because it is impossible for CloudFlare to know that you have solved the CAPTCHA.
    This is what causes the infinite CAPTCHA loops. It doesn't matter how many of them you solve, Firefox won't let CloudFlare make a note that you have solved it, and then when it reloads the page you obviously must have just tried to load the page again without solving it.
    https://i.imgur.com/gMaq0Rx.png
    
    genewitch 11 hours ago
    
    You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?
    This sounds like "we only save hashed minutiae of your biometrics"
    
    aaronmdjones 34 minutes ago
    
    > You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?
    Yes?
    HTTP is stateless. It always has been and it always will be. If you want to pass state between page visits (like "I am logged in to account ..." or "My shopping cart contains ..." or "I solved a CAPTCHA at ..."), you need to be given, and return back to the server on subsequent requests, cookies that encapsulate that information, or encapsulate a reference to an identifier that the server can associate with that information.
    This is nothing new. Like gruez said in a sibling comment; this is what session cookies do. Almost every website you ever visit will be giving you some form of session cookie.
    
    zaphar 6 hours ago
    
    Then don't visit the site. Cloudflare is in the loop because the owner of the site wanted to buy not build a solution to the problems that Cloudflare solves. This is well within their rights and a perfectly understandable reason for Cloudflare to be there. Just as you are perfectly within your rights to object and avoid the site.
    What is not within your rights is to require the site owner to build their own solution to your specs to solve those problems or to require the site owner to just live with those problems because you want to view the content.
    
    gruez 6 hours ago
    
    >You're telling me cloudflare has to store something on my computer to let them know I passed a captcha?
    You realize this is the same as session cookies, which are used on nearly every site, even those where you're not logging in?
    >This sounds like "we only save hashed minutiae of your biometrics"
    A randomly generated identifier is nowhere close to "hashed minutiae of your biometrics".
    
    genewitch 28 minutes ago
    
    the idea that cloudflare doesn't know who i am without a cookie is insulting.
    
    gruez 19 hours ago
    
    The infinite loop or the challenge appearing? I've never had problems with passing the challenge, even with ETP + RFP + ublock origin + VPN enabled.
    
    cookiengineer 18 hours ago
    
    Cloudflare is too stupid to realize that carrier grade NATs exist a lot in Germany. So there's that, sharing an IP with literally 20000 people around me doesn't make me suspicious when it's them that trigger that behavior.
    Your assumption is that anyone at cloudflare cares. But guess what, it's a self fulfilling prophecy of a bot being blocked, because not a single process in the UX/UI allows any real user to complain about it, and therefore all blocked humans must also be bots.
    Just pointing out the flaw of bot blocking in general, because you seem to be absolutely unaware of it. Success rate of bot blocking is always 100%, and never less, because that would imply actually realizing that your tech does nothing, really.
    Statistically, the ones really using bots can bypass it easily.
    
    gruez 6 hours ago
    
    >Cloudflare is too stupid to realize that carrier grade NATs exist a lot in Germany. So there's that, sharing an IP with literally 20000 people around me doesn't make me suspicious when it's them that trigger that behavior.
    Tor and VPNs arguably have the same issue. I use both and haven't experienced "infinite loops" with either. The same can't be said of google, reddit, or many other sites using other security providers. Those either have outright bans, or show captchas that require far more effort to solve than clicking a checkbox.
    
    viraptor 8 hours ago
    
    If you want to try fighting it, you need to find someone with CF enterprise plan and bot management working, then get blocked and get them to report that as wrong. Yes it sucks and I'm not saying it's a reasonable process. Just in case you want to try fixing the situation for yourself.
    
    xena 11 hours ago
    
    Honestly it's a fair assumption on bot filtering software that no more than like 8 people will share an IPv4. This is going to make IP reputation solutions hard. Argh.
  - xena 20 hours ago
    
    Apparently user-agent switchers don't work for fetch() requests, which means that Anubis can't work with people that do that. I know of someone that set up a version of brave from 2022 with a user-agent saying it's chrome 150 and then complaining about it not working for them.
  - megous 20 hours ago
    
    Proper response here is "fuck cloudflare", instead of blaming the user.
    
    gruez 6 hours ago
    
    It's well within your rights to go out of your way to be suspicious (eg. obfuscating your user-agent). At the same time sites are within their rights to refuse service to you, just like banks can refuse service to you if you show up wearing a balaclava.
oynqr 21 hours ago

Lucky. Took 30s for me.
nicce 21 hours ago

For me it is like 0.5s. Interesting.

pabs3 17 hours ago

It works to block users who have JavaScript disabled, that is for sure.

udev4096 12 hours ago

Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web

throwaway150 21 hours ago

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

crq-yml 21 hours ago

It's a strategy to redefine the doctrine of information warfare on the public Internet from maneuver(leveraged and coordinated usage of resources to create relatively greater effects) towards attrition(resources are poured in indiscriminately until one side capitulates).
Individual humans don't care about a proof-of-work challenge if the information is valuable to them - many web sites already load slowly through a combination of poor coding and spyware ad-tech. But companies care, because that changes their ability to scrape from a modest cost of doing business into a money pit.
In the earlier periods of the web, scraping wasn't necessarily adversarial because search engines and aggregators were serving some public good. In the AI era it's become belligerent - a form of raiding and repackaging credit. Proof of work as a deterrent was proposed to fight spam decades ago(Hashcash) but it's only now that it's really needed to become weaponized.
marginalia_nu 21 hours ago

The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.
If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.
- Nathanba 18 hours ago
  
  Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.
  - ndiddy 16 hours ago
    
    AI companies have started using a technique to evade rate limits where they will have a swarm of tens of thousands of scraper bots using unique residential IPs all accessing your site at once. It's very obvious in aggregate that you're being scraped, but when it's happening, it's very difficult to identify scraper vs. non-scraper traffic. Each time a page is scraped, it just looks like a new user from a residential IP is loading a given page.
    Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.
    
    Nathanba 16 hours ago
    
    Tens of thousands of scraper bots for a single site? Is that really the case? I would have assumed that maybe 3-5 bots send lets say 20 requests per second in parallel to scrape. Sure, they might eventually start trying different ips and bots if their others are timing out but ultimately it's still the same end result: All they will realize is that they have to increase the timeout and use headless browsers to cache results and the entire protection is gone. But yes, I think for big bot farms it will be a somewhat annoying cost increase to do this. This should really be combined with the cloudflare captcha to make it even more effective.
    
    marginalia_nu 12 hours ago
    
    A lot of the worst offenders seem to be routing the traffic through a residential botnet, which means that the traffic really does come from a huge number of different origins. It's really janky and often the same resources are fetched multiple times.
    Saving and re-using the JWT cookie isn't that helpful, as you can effectively rate limit using the cookie as identity, so to reach the same request rates you see now they'd still need to solve hundreds or thousands of challenges per domain.
    
    Hasnep 15 hours ago
    
    If you're sending 20 requests per second from one IP address you'll hit rate limits quickly, that's why they're using botnets to DDoS these websites.
    
    vhcr 16 hours ago
    
    Until someone writes the proof of work code for GPUs and it runs 100x faster and cheaper.
    
    marginalia_nu 10 hours ago
    
    A big part of the problem with these scraping operations is how poorly implemented they are. They can get a lot cheaper gains by simply cleaning up how they operate, to not redundantly fetch the same documents hundreds of times, and so on.
    Regardless of how they solve the challenges, creating an incentive to be efficient is a victory in itself. GPUs aren't cheap either, especially not if you're renting them via a browser farm.
    
    runxiyu 13 hours ago
    
    Anubis et al. are also looking into alternative algorithms. There seems to be consensus that SHA-256 PoW is not appropriate
    
    genewitch 11 hours ago
    
    There's lots of other ones but you want hashes that use lots of RAM, stuff like scrypt used to be the go-to but I am sure there are better, now.
  - Hakkin 17 hours ago
    
    It sets a cookie with a JWT verifying you completed the proof-of-work along with metadata about the origin of the request, the cookie is valid for a week. This is as far as Anubis goes, once you have this cookie you can do whatever you want on the site. For now it seems like enough to stop a decent portion of web crawlers.
    You can do more underneath Anubis using the JWT as a sort of session token though, like rate limiting on a per proof-of-work basis, if a client using X token makes more than Y requests in a period of time, invalidate the token and force them to generate a new one. This would force them to either crawl slowly or use many times more resources to crawl your content.
  - FridgeSeal 15 hours ago
    
    It seems a good chunk of the issue with these modern LLM scrapers is that they are doing _none_ of the normal “sane” things. Caching content, respecting rate limits, using sitemaps, bothering to track explore depth properly, etc.
- charcircuit 19 hours ago
  
  If you make it prohibitively expensive almost no regular user will want to wait for it.
  - xboxnolifes 18 hours ago
    
    Regular users usually aren't page hopping 10 pages per second. A regular user is usually 100 times less than that.
    
    pabs3 17 hours ago
    
    I tend to get blocked by HN when opening lots of comment pages in tabs with Ctrl+click.
    
    xboxnolifes 14 hours ago
    
    Yes, HN has a fairly strict slow down policy for commenting. But, that's irrelevant to the context.
    
    pabs3 7 hours ago
    
    I meant to say article pages not comment pages, but ack.
  - bobmcnamara 18 hours ago
    
    Exponential backoff!
ndiddy 21 hours ago

This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works. When GNOME deployed it on their Gitlab, they found that around 97% of the traffic in a given 2.5 hour period was blocked by Anubis. https://social.treehouse.systems/@barthalion/114190930216801...
- dragonwriter 19 hours ago
  
  > This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works.
  It works in the short term, but the more people that use it, the more likely that scrapers start running full browsers.
  - sadeshmukh 18 hours ago
    
    Which are more expensive - you can't run as many especially with Anubis
  - SuperNinKenDo 16 hours ago
    
    That's the point. An individual user doesn't lose sleep over using a full browser, that's exactly how they use the web anyway, but for an LLM scraper or similar, this greatly increases costs on their end and partially thereby rebalances the power/cost imbalance, and at the very least, encourages innovations to make the scrapers externalise costs less by not rescraping things over and over again just because you're too lazy, and the weight of doing so is born by somebody else, not you. It's an incentive correction for the commons.
perching_aix 21 hours ago

Nothing. The idea instead that at scale the expenses of solving the challenges becomes too great.
userbinator 16 hours ago

This is basically the DRM wars again. Those who have vested interests in mass crawling will have the resources to blast through anything, while the legit users get subjected to more and more draconian measures.
- SuperNinKenDo 16 hours ago
  
  I'll take this over a Captcha any day.
  - userbinator 14 hours ago
    
    CAPTCHAs don't need JS, nor does asking a question that an LLM can't answer but a human can.
    Proof-of-work selects for those with the computing power and resources to do it. Bitcoin and all the other cryptocurrencies show what happens when you place value on that.
ronsor 19 hours ago

I know companies that already solve it.
- creata 18 hours ago
  
  Why is spending all that CPU time to scrape the handful of sites that use Anubis worth it to them?
  - vhcr 16 hours ago
    
    Because it's not a lot of CPU, you only have to solve it once per website, and the default policy difficulty of 16 for bots is worthless because you can just change your user agent so you get a difficulty of 4.
- wredcoll 17 hours ago
  
  I mean... knowing how to solve it isn't the trick, it's doing it a million times a minute for your firehose scraper.
  - udev4096 12 hours ago
    
    Anubis adds a cookie name `within.website-x-cmd-anubis-auth` which can be used by scrapers for not solving it more than once. Just have a fleet of servers whose sole purpose is to extract the cookie after solving the challenges and make sure all of them stay valid. It's not a big deal

cookiengineer 18 hours ago

I am currently building a prototype of what I call the "enigma webfont" where I want to implement user sessions with custom seeds / rotations for a served and cached webfont.

The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.

This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.

I sure would love to hear your input, Xe. Maybe we can combine our efforts?

My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.

lifthrasiir 17 hours ago

Besides from the obvious accessibility issue, wouldn't that be a substitution cipher at best? Enough corpus should render its cryptanalysis much easier.
- cookiengineer 14 hours ago
  
  Well, the idea is basically the same as using AES-CBC. CBC is useless most of the time because of static rotations, but it makes cracking it more expensive.
  With the enigma webfont idea you can even just select a random seed for each user/cache session. If you map the URLs based on e.g. SHA512 URLs via the Web Crypto API, there's no cheap way of finding that out without going full in cracking mode or full in OCR/tesseract mode.
  And cracking everything first, wasting gigabytes of storage for each amount of rotations and seeds...well, you can try but at this point just ask the admin for the HTML or dataset instead of trying to scrape it, you know.
  In regards to accessibility: that's sadly the compromise I am willing to do, if it's a technology that makes my specific projects human eyes only (Literally). I am done taking the costs for hundreds of idiots that are too damn stupid to clone my website from GitHub, letting alone violating every license in each of their jurisdictions. If 99% of traffic is bots, it's essentially DDoSing on purpose.
  We have standards for data communication, it's just that none of these vibe coders gives a damn about building semantically correct HTML and parsers for RDF, microdata etc.
  - lifthrasiir 13 hours ago
    
    No, I was talking about generated fonts themselves; each glyph would have an associated set of control points which can be used to map a glyph to the correct letter. No need to run the full OCR, you need a single small OCR job per each glyph. You would need quite elaborate distortions to avoid this kind of attack, and such distortions may affect the reading experience.
    
    cookiengineer 5 hours ago
    
    I am not sure how this would help, and I don't think I understood your argument.
    The HTML is garbage without a correctly rendered webfont that is specific to the shifts and replacements in the source code itself. The source code does not contain the source of the correct text, only the already shifted text.
    Inside the TTF/OTF files themselves each letter is shifted, meaning that the letters only make sense once you know the seed for the multiple shifts, and you cannot map 1:1 the glyphs in the font to anything in the HTML without it.
    The web browser here is pretty easy to trick, because it will just replace the glyphs available in the font, and fallback to the default font if they aren't available. Which, by concept, also allows partial replacements and shifts for further obfuscation if needed, additionally you can replace whole glyph sequences with embedded ligatures, too.
    The seed can therefore be used as an instruction mapping, instead of only functioning as a byte sequence for a single static rotation. (Hence the reference to enigma)
    How would control points in the webfont files be able to map it back?
    If you use multiple rotations like in enigma, and that is essentially the seed (e.g. 3,74,8,627,whatever shifts after each other). The only attack I know about would be related to alphabet statistical analysis, but that won't work once the characters include special characters outside the ASCII range because you won't know when words start nor when they end.
- creata 17 hours ago
  
  There's probably something horrific you can do with TrueType to make it more complex than a substitution cipher.
  - cookiengineer 14 hours ago
    
    The hint I want to give you is: unicode and ligatures :) they're awesome in the worst sense. Words can be ligatures, too, btw.
  - lifthrasiir 17 hours ago
    
    GSUB rules are inherently local, so for example the same cryptanalysis approach should work for space-separated words instead of letters. A polyalphabetic cipher would work better but that means you can't ever share the same internal glyph for visually same but differently encoded letters.
rollcat 12 hours ago

The problem isn't as much that the websites are scraped (search engines have been doing this for over three decades), it's the request volume that brings the infrastructure down and/or costs up.
I don't think mangling the text would help you, they will just hit you anyway. The traffic patterns seem to indicate that whoever programmed these bots, just... <https://www.youtube.com/watch?v=ulIOrQasR18>
> I sure would love to hear your input, Xe. Maybe we can combine our efforts?
From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.

deknos 13 hours ago

I wish, there was also an tunnel software (client+server) where

* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat 12 hours ago

Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>
> his
I believe it's their.
- deknos 11 hours ago
  
  ups, yes, sorry, their.
immibis 11 hours ago

Tor's Webtunnel?
- deknos 11 hours ago
  
  but i do not want to go OVER tor, i just want a service over clearnet? or is this something else? do you have an URL?
  - immibis 6 hours ago
    
    I presume the protocol can be separated from Tor itself and I also presume this standalone thing doesn't exist yet.
    In any situation, you're going to need some custom client code to route your traffic through the tunnel you opened, so I'm not sure why the login page that opens the tunnel needs to be browser-compatible?

udev4096 12 hours ago

PoW captchas are not new. What's different with Anubis? How can it possibly prevent "AI" scrapers if the bots have enough compute to solve the PoW challenge? AI companies have quite a lot of GPUs at their disposal and I wouldn't be surprised if they used it for getting around PoW captchas

relistan 12 hours ago

The point is to make it expensive to crawl your site. Anyone determined to do so is not blocked. But why would they be determined to do so for some random site? The value to the AI crawler likely does not match the cost to crawl it. It will just move on to another site.
So the point is not to be faster than the bear. It’s to be faster than your fellow campers.
- genewitch 11 hours ago
  
  Why not have them hash pow for btc then?
  - sprremix 4 hours ago
    
    Why must everything involve $'s?
    
    genewitch 27 minutes ago
    
    because there's a lot of rhetoric about how this "balances the imbalance between serving a request and making that request" and if we're having them do sha256, why not have them do sha256(sha256(data+random nonce)) and potentially earn the site owner some money?

pabs3 17 hours ago

Recently I heard of a site blocking bot requests with a message telling the bot to download the site via Bittorrent instead.

Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.

seba_dos1 16 hours ago

Even if these stupid bots would just learn to clone git repos instead of crawling through GitLab UI pages it would already be helpful.

snvzz 9 hours ago

My Amiga 1200 hates these tools.

It is really sad that the worldwide web has been taken to the point where this is needed.

babuloseo 18 hours ago

Nice will try to deploy to my sites after I eat some mac and cheese

matt3210 17 hours ago

A package which includes the cool artwork would be awesome

xena 16 hours ago

You mean with the art assets extracted?

  $ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static

matt3210 19 hours ago

Very nice work!

anubiskhan 20 hours ago

I approve.

perching_aix 21 hours ago

> Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.

Will be interested to hear of that. In the meantime, at least I learned of JShelter.

Edit:

Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.

This does remind me how all of these additional hoops are making web browsing slow.

Edit #2:

Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.

So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

Edit #3:

Turning reasoning on for a moment, this whole thing is a bit iffy.

First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.

So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.

So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.

Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.

The more I think about this, the less sensible and agreeable it is. I don't know man.

Philpax 20 hours ago

> First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training.
This isn't the goal; the goal is to punish/demotivate poorly-behaved scrapers that hammer servers instead of moderating their scraping behaviour. At least a few of the organisations deploying Anubis are fine with having their data scraped and being made part of an AI model.
They just don't like having their server being flooded with non-organic requests because the people making the scrapers have enough resources that they don't have to care that they're externalising the costs of their malfeasance on the rest of the internet.
- perching_aix 20 hours ago
  
  Ah, thanks for the clarification. I guess then it pulling a double duty against all scraping in general is not a flaw either.
abetusk 20 hours ago

Time delay, as you proposed, is easily defeated by concurrent connections. In some sense, you're sacrificing latency without sacrificing throughput.
A bot network can make many connections at once, waiting until the timeout to get the entirety of their (multiple) request(s). Every serial delay you put in is a minor inconvenience to a bot network, since they're automated anyway, but a degrading experience for good faith use.
Time delay solutions get worse for services like posting, account creation, etc. as they're sidestepped by concurrent connections that can wait out the delay to then flood the server.
Requiring proof-of-work costs the agent something in terms of resources. The proof-of-work certificate allows for easy verification (in terms of compute resources) relative to the amount of work to find the certificate in the first place.
A small resource tax on agents has minimal effect on everyday use but has compounding effect for bots, as any bot crawl now needs resources that scale linearly with the number of pages that it requests. Without proof-of-work, the limiting resource for bots is network bandwidth, as processing page data is effectively free relative to bandwidth costs. By requiring work/energy expenditure to requests, bots now have a compute as a bottleneck.
As an analogy, consider if sending an email would cost $0.01. For most people, the number of emails sent over the course of a year could easily cost them less than $20.00, but for spam bots that send email blasts of up to 10k recipients, this now would cost them $100.00 per shot. The tax on individual users is minimal but is significant enough so that mass spam efforts are strained.
It doesn't prevent spam, or bots, entirely, but the point is to provide some friction that's relatively transparent to end users while mitigating abusive use.
marginalia_nu 21 hours ago

You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.
It's a shitty solution to an even shittier reality.
- xena 20 hours ago
  
  Main author of Anubis here:
  Basically what they said. This is a hack, and it's specifically designed to exploit the infrastructure behind industrial-scale scraping. They usually have a different IP address do the scraping for each page load _but share the cookies between them_. This means that if they use headless chrome, they have to do the proof of work check every time, which scales poorly with the rates I know the headless chrome vendors charge for compute time per page.
  - ArinaS 12 hours ago
    
    Is there any particular date/time you'll introduce a no-JS solution?
    And are you going to support older browsers? I tested Anubis with https://www.browserling.com with its (I think) standard configuration at https://git.xeserv.us/xe/anubis-test/src/branch/main/README.... and apparently it doesn't work with Firefox versions before 74 and Chromium versions before 80.
    I wonder if it works with something like Pale Moon.
    
    xena 12 hours ago
    
    It will be sooner if I can get paid enough to be able to quit my day job.
  - vhcr 16 hours ago
    
    I used to have an ISP that would load balance your connection between different providers, this meant that pretty much every single request would use a different IP. I know it's not that common, but that would mean real users would find pages using anubis unusable.
  - lifthrasiir 19 hours ago
    
    Do you think that, if this behavior of Anubis gets well-known and Anubis cookies are specifically handled to avoid pathological PoW checks, does Anubis need a significant rework? Because if it's indeed true this hack wouldn't last much longer and I have no further idea to avoid user-visible annoyances.
    
    solid_fuel 19 hours ago
    
    Well, if they rework things so that requests all originate from the same IP address or a small set of addresses, then regular IP-based rate limits should work fine right?
    The point is just to stop what is effectively a DDoS because of shitty web crawlers, not to stop the crawling entirely.
    
    lifthrasiir 18 hours ago
    
    > Well, if [...], then regular IP-based rate limits should work fine right?
    I'm not sure. IP-based rate limits have a well-known issue with shared public IPs for example. Technically they are also more resource-intensive than cryptographic approaches too (but I don't think that's not a big issue in IPv4).
    
    dharmab 17 hours ago
    
    > then regular IP-based rate limits should work fine right?
    These are also harmful to human users, who are often behind CGNAT and may be sharing a pool of IPs with many thousands of other ISP subscribers.
  - specialist 16 hours ago
    
    > Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers
    Based on the comments here, it seems like many people are struggling with the concept.
    Would calling Anubis a "client-side rate limiter" be accurate (enough)?
    
    runxiyu 13 hours ago
    
    Probably not

1oooqooq 11 hours ago

tries to block abusive companies using infinite money glitch from clueless investors, by making every request cost a few fractions of a cent more.

... yeah, that will totally work.

dmtfullstack 18 hours ago

[dead]

apt-apt-apt-apt 18 hours ago

Since Anubis is related to AI, the part below read as contradictory at first. As if too many donations would cause the creator to disappear off to Tahiti along with the product development.

"If you are using Anubis .. please donate on Patreon. I would really love to not have to work in generative AI anymore..."

userbinator 16 hours ago

Yes, it just worked to stop me, an actual human, from seeing what you wanted to say... and I'm not interested enough to find a way around it that doesn't involve cozying up to Big Browser. At least CloudFlare's discrimination can be gotten around without JS.

Wouldn't it be ironic if the amount of JS served to a "bot" costs even more bandwidth than the content itself? I've seen that happen with CF before. Also keep in mind that if you anger the wrong people, you might find yourself receiving a real DDoS.

If you want to stop blind bots, perhaps consider asking questions that would easily trip LLMs but not humans. I've seen and used such systems for forum registrations to prevent generic spammers, and they are quite effective.

userbinator 14 hours ago

Looks like I struck a nerve. Big Browser, hello ;-)