Read Post
Sun, 10 Nov 2024 16:38:11 -0800
Andy from
private IP, post #16994410
/all
A deep dive into the Boyer-Moore string search algorithm, which is critical for
this site
When I was writing the backend for the new JDU, I had a choice between a
database-driven web application like the vast majority of bulletin boards, or a
novel filesystem-based approach that leverages the power of command line tools
and file storage on Linux. Initially, I was concerned that the amount of uses
of the Linux command line tool "grep" would be processor-intensive and result in
long load times. I was wrong. The Boyer-Moore string search algorithm, which
forms the basis of grep, is BLINDINGLY FAST. As in, your mind cannot conceive
how fast the algorithm is, even on millions of lines of text. It does this by,
in part, pre-processing the pattern to search for so that it only needs to
perform operations that would actually find the pattern. For example, let's say
you have the pattern "i-am-awesome" and you're looking for it in a 100 MB text
file. Rather than slowly read the file character by character, grep first
determines that you're looking for a 12-character sequence. Thus, at most it
only needs to read 12-character blocks of the file. Then, rather than searching
through every 12 characters, grep looks at the *12th* character in every
12-character block, sees whether it matches the 12th character of the pattern,
and proceeds accordingly. If the 12th character matches, it moves backward one
position and determines whether the 11th character matches, and so on. If the
11th character doesn't match, grep determines that block doesn't match, and
moves on. This is an unbelievably fast way to match patterns and is one reason
why every page on this site loads in between 20 and 200 ms, which is better
performance than I would get from a large SQL database. I've also written a few
optimizations, such as caching the number of replies to each post and updating
that number anytime a reply is posted, rather than searching through the files
to determine how many replies there are every time someone loads the index. I
also came up with the technique of reading the last 500 log lines from the web
server log and searching registered usernames in the log to see who is recently
active on the site. That was freaking genius. The load on the server is
virtually zero at all times thanks to my techniques and the design of Mosaic,
which is what I am calling this system. There are many improvements and new
features coming down the pipe. I'm sorry for the boast, but I really am the
greatest lawyer-programmer in the world.
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
#Programming #Technology
Mon, 11 Nov 2024 08:20:13 -0800
Andy from
private IP
Reply #19466453 The most important command line
tools I use on here are grep, cut, and tail. That's most of what it takes to
run this site. Cut is an interesting tool because it can slice text files based
on a delimiter and just give you the fields you specify. Tail is great because
you can take a huge text file, seek to the end, and get the last 500 log lines
or whatever other number you want. All of these tools are highly optimized and
ideal for my use case of a filesystem-based BBS. When combined, they are
powerful tools even though individually they don't do very much. I also use the
find command, which is the fastest way of searching a filesystem and executing
grep, cut, and/or tail on the files found. Linux is the best platform I have
ever used in nearly 40 years of using computers. When I switched over to Red
Hat Enterprise Linux for my desktop workstation, it was confirmed that RHEL 9
makes Windows look like a toy. RHEL 9 also destroys macOS in the area of
customization, although macOS is arguably easier to live with. It's an ongoing
process for me, and I learn something new almost every day.
Replies require login.