JD Underground

Read Post
Sat, 03 May 2025 20:16:32 -0700
Andy from private IP, post #11564496

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry   👍 
/all
Emergency maintenance: an unbelievable number of bots are crawling this site and are clogging up my log file

This is ridiculous.  95% of the traffic today is bots with forged user-agent strings that pretend to be various operating systems.  They are executing search
queries that each take 750 milliseconds and are causing the machine to have to work unreasonably hard.  I'm going to work on this right now and there may be
some errors, sorry about that.

#Programming #Technology 


Sat, 03 May 2025 20:55:47 -0700
Andy from private IP
Reply #13033143

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
Done.  I temporarily disabled the post ID and reply ID links, which is where the traffic was coming from.  This is insane.  I shouldn't have to pay for A.I.
systems to learn by aggressively crawling my site pretending to be real users.  This is the same problem Wikipedia had, on a smaller scale.


Sat, 03 May 2025 21:05:47 -0700
Andy from private IP
Reply #18363684

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
I also improved something else that should slightly increase the performance of the site.  I have been copying the whole web server log every five minutes, then
reading the last 1000 lines of the file to get the active user list.  That is very inefficient with a log file that is growing larger by the day and is now 160
megabytes.  So I've set it to simply copy the last 2000 lines of the log file every five minutes, which is a much smaller task.  I'm testing it now and it works
perfectly.  No joke, I am the greatest lawyer-programmer in the world.


Sun, 04 May 2025 01:38:10 -0700
Wily from private IP
Reply #15454681

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
Those must be the 327 guests. I really doubt there are more than 10 lurkers here at any time, if that.


Sun, 04 May 2025 07:31:15 -0700
Andy from private IP
Reply #18966924

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
There are usually about 100 guests on the site on a particular day.  I already filter out bots and spiders when counting the guests, but that requires the
letters "bot" or "spider" to appear in the user-agent string.  When they give a fake user-agent string, I cannot do that and will have to come up with another
method.  Either way, suddenly spiking to 300+ guests is only explainable by automated processes.  They are probably training A.I. systems on user-generated
content because it is authentic and represents valuable data for A.I. LLMs.  I disabled the most-abused feature they were using on here, so maybe the site
becomes less interesting to these offenders.


Sun, 04 May 2025 10:42:23 -0700
phosita from private IP
Reply #12740141

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
Sounds like maybe you need to do some logfile management in the same way that, e.g., syslog gets.

foo_log.txt
foo_log.0.txt.gz
foo_log.1.tx.gz
[ad infinitum]

Then later if you ever need to reconstitute the entire log (or, for that matter, only certain regions thereof) you can zcat the chunks together >>
some_output.txt

For getting current users, you could roughly do:
tail -f your_webserver_log.txt | some_daemon.sh > recent_users.txt

...wherein some_daemon.sh just does a window function on what it gets on stdin.

Or you could stop fucking with plain text files and use a database. :)


Sun, 04 May 2025 10:44:51 -0700
Andy from private IP
Reply #16230194

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
@phositaTest Thanks, I don't have log rotation configured at the moment.  I
might rotate the logs once a year for compliance purposes.  I'm happy with the current setup, but the daemon idea is interesting.  I'll take a look at that :)


Sun, 04 May 2025 13:06:31 -0700
phosita from private IP
Reply #17889646

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
Rando thoughts:

On the forged user-agent strings: (a) if you ever move the site to Cloudflare then that will be automagically done for you (I know because I used to run wget
against some stuff and it stopped working due to Cloudflare's bot detection); and (b) my intuition is that on the webserver level you can probably find a module
for doing roughly this on your own machine.

Honestly I haven't checked whether you're using Apache but if so then this is probably food for thought.  (If you aren't already doing this.)
https://stackoverflow.com/questions/51972679/how-to-block-a-specific-user-agent-in-apache

Also, question: for the purpose of complying with what are you rotating the logs of the slash.law webserver annually?  Or at all?


Sun, 04 May 2025 13:34:06 -0700
Andy from private IP
Reply #19288102

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
@phositaTest The retention of logs is so I can look back at my leisure and
determine whether anything unusual is happening.  The logs go back to June 2024, so I figure a year is enough time.  Compliance, so to speak, is just for
internal purposes.


Sun, 04 May 2025 18:37:54 -0700
phosita from private IP
Reply #13037371

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry   👍 
Just for grins I am trying the copy-and-tail maneuver complained about supra. As a sample file I just concatenated /var/log/syslog onto itself a bunch of times
and limited it to 160 MiB.

On what I'd call modest CPU and pretty decent disk:
(a) copying the sample file in place takes < 0.2 sec wall clock.
(b) tailing the last 2,000 lines and grepping for some pattern which will never be found takes 0.006 sec.

I must be missing something.  If you copy-and-tail every 5 minutes you will never be more than appx 5 minutes in arrears as long as whatever processing you do
on the tailed-off 2000 lines takes not much time.  What ops are you doing on those 2000 lines that takes so long as to add unacceptable delay or load?

Coming at the copying operation a bit differently, using ZFS:
step 1, take a snapshot;
step 2, clone the snapshot

Step 1 + 2 total take ~140 msec. And I think that's guaranteed atomic in a way /usr/bin/cp isn't.

Adding step 3 of destroying the clone costs ~ 90 msec. Meanwhile, any processes which were reading to or writing from the dataset from which the snapshot was
taken are none the wiser. Creation of the clone, and subsequent reading of the clone, do not consume disk space.




Sun, 04 May 2025 21:18:07 -0700
Andy from private IP
Reply #15652421

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
@phositaTest That's about what I expected.  I only perform a few operations on
the last 1,000 lines of the log file, so I was mainly considering the wastefulness of copying the full log file every five minutes.  That's about 100 ms of disk
activity that I don't need when all I have to do is tail the last 2000 lines of the file into my web folder and analyze it.  Just for the hell of it, I timed
the copy vs. the tail and here are my results on XFS with a RAID 5 of three enterprise SSDs:

[raellic@www-andrewwatters-com JDU]$ time sudo cp /var/log/httpd/slash-ssl-access.log ./foo

real	0m0.088s
user	0m0.008s
sys	0m0.080s

[raellic@www-andrewwatters-com JDU]$ time sudo tail -2000 /var/log/httpd/slash-ssl-access.log | wc -l
2000

real	0m0.018s
user	0m0.006s
sys	0m0.014s

Copying takes 88 ms and tailing takes 18 ms.  So tailing the last 2000 lines of the log file is about 500% faster than copying the full file.  Thus, I can
decrease the user list interval with no penalty-- instead of being five minutes in arrears, I can tail the live log file if I really want to.  My original
intention was to create a working copy of the log file and slice it up however I want, but so far it's very straightforward operations that do not require a
working copy.

As you can see, each page on this site is generated in between about 25 ms and 750 ms.  The index page has been holding steady at 440 ms or so, most of which
consists of using the sed utility to extract the first couple of lines of each post.  If I didn't do that and instead cached the post titles, it would improve
by about 1,000%.  Haven't gotten around to doing that.

I'm very happy with what I consider to be a blazing fast interface.  I especially like the fact that I don't use any SQL-- or any database, for that matter. 
Everything is done in file operations using standard Linux command line tools that I integrated into my own high performance BBS.  It has been an interesting
experience so far, and I intend to continue gradually improving my system, which I call MOSAIC.


Sun, 04 May 2025 22:50:44 -0700
phosita from private IP
Reply #17694389

React 
 Like 
 Love 
 Haha 
 Wow 
 Sad 
 Angry  
Interesting.  I tried on two different nvme destinations: a first which is a ZFS mirror of two HGST nvme enterprise drives; and a second which is just a
prosumer single-drive with extfs.  The former is only a smidge faster than the cp I did earlier. The single nvme + ext4 did the copy in ~100 msec - not bad for
prosumer nothin' special storage.

Tailing is stupid fast on all my shit.  ahhh wait a sec., to be fair I should flush the filesystem caches.  bhahah dumbass. Ok ~10 msec after rebooting. 
Fairer.

Don't want to go down the disk benching rabbit hole at this hour.  I meannnn, I do, but....



 @11564496 Andy 👍 @13037371 Andy 👍
Replies require login.