JD Underground

Read Post

Sun, 10 Nov 2024 16:38:11 -0800
Andy from private IP, post #16994410

React
Like
Love
Haha
Wow
Sad
Angry
/all
A deep dive into the Boyer-Moore string search algorithm, which is critical for this site

When I was writing the backend for the new JDU, I had a choice between a database-driven web application like the vast majority of bulletin boards, or a novel
filesystem-based approach that leverages the power of command line tools and file storage on Linux. Initially, I was concerned that the amount of uses of the
Linux command line tool "grep" would be processor-intensive and result in long load times. I was wrong. The Boyer-Moore string search algorithm, which forms
the basis of grep, is BLINDINGLY FAST. As in, your mind cannot conceive how fast the algorithm is, even on millions of lines of text. It does this by, in
part, pre-processing the pattern to search for so that it only needs to perform operations that would actually find the pattern. For example, let's say you
have the pattern "i-am-awesome" and you're looking for it in a 100 MB text file. Rather than slowly read the file character by character, grep first determines
that you're looking for a 12-character sequence. Thus, at most it only needs to read 12-character blocks of the file. Then, rather than searching through
every 12 characters, grep looks at the *12th* character in every 12-character block, sees whether it matches the 12th character of the pattern, and proceeds
accordingly. If the 12th character matches, it moves backward one position and determines whether the 11th character matches, and so on. If the 11th character
doesn't match, grep determines that block doesn't match, and moves on. This is an unbelievably fast way to match patterns and is one reason why every page on
this site loads in between 20 and 200 ms, which is better performance than I would get from a large SQL database. I've also written a few optimizations, such
as caching the number of replies to each post and updating that number anytime a reply is posted, rather than searching through the files to determine how many
replies there are every time someone loads the index. I also came up with the technique of reading the last 500 log lines from the web server log and searching
registered usernames in the log to see who is recently active on the site. That was freaking genius. The load on the server is virtually zero at all times
thanks to my techniques and the design of Mosaic, which is what I am calling this system. There are many improvements and new features coming down the pipe.
I'm sorry for the boast, but I really am the greatest lawyer-programmer in the world.

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

#Programming #Technology

Mon, 11 Nov 2024 08:20:13 -0800
Andy from private IP
Reply #19466453

React
Like
Love
Haha
Wow
Sad
Angry The most important command line tools I use on here are grep, cut, and tail. That's most of what it takes to run this
site. Cut is an interesting tool because it can slice text files based on a delimiter and just give you the fields you specify. Tail is great because you can
take a huge text file, seek to the end, and get the last 500 log lines or whatever other number you want. All of these tools are highly optimized and ideal for
my use case of a filesystem-based BBS. When combined, they are powerful tools even though individually they don't do very much. I also use the find command,
which is the fastest way of searching a filesystem and executing grep, cut, and/or tail on the files found. Linux is the best platform I have ever used in
nearly 40 years of using computers. When I switched over to Red Hat Enterprise Linux for my desktop workstation, it was confirmed that RHEL 9 makes Windows
look like a toy. RHEL 9 also destroys macOS in the area of customization, although macOS is arguably easier to live with. It's an ongoing process for me, and
I learn something new almost every day.

Replies require login.