String Searching

Imagine you are a Site Reliability Engineer (SRE) managing a massive distributed system. A critical failure occurs, and you need to parse gigabytes of server logs to find the exact timestamp of an OutOfMemoryError. String searching (or pattern matching) is the core algorithmic problem of finding a pattern string within a massive text stream. It’s the engine behind your browser’s Ctrl+F, your IDE’s global search, and massive log analytics platforms.

1. The Problem

Given:

  • Text (T): A longer string of length N.
  • Pattern (P): A shorter string of length M.

Goal: Return the index of the first occurrence of P in T, or -1 if not found.

2. The Naive Approach (Brute Force)

The simplest way to solve this is to check every possible position in the text.

  1. Align the pattern with the beginning of the text.
  2. Compare characters one by one.
  3. If a mismatch occurs, shift the pattern one step to the right.
  4. Repeat until a match is found or the end of the text is reached.

Time Complexity

  • Best Case: O(N) (e.g., Pattern found at start).
  • Worst Case: O(N * M) (e.g., Text=”AAAA…B”, Pattern=”AAAB”).
  • Space Complexity: O(1) (no extra space needed).

3. The Mathematical Anchor: Rabin-Karp & Rolling Hashes

To avoid re-scanning every character, Rabin-Karp uses a mathematical trick: Rolling Hashes.

The Hash Formula

Instead of comparing strings character-by-character, we compare their numerical fingerprint. For a string S of length M, the hash H is: H = (S[0] · BM-1 + S[1] · BM-2 + … + S[M-1] · B0) mod Q (Where B is a base, usually a prime like 31, and Q is a large prime to avoid overflow).

The “Rolling” Part

When we shift the window from [i] to [i+1], we don’t re-calculate the whole hash. We:

  1. Subtract the value of T[i].
  2. Multiply by base B.
  3. Add the value of T[i+M]. This transformation takes O(1) time, allowing us to find matches in O(N) average time.

4. The Mathematical Anchor: KMP & The LPS Array

The Knuth-Morris-Pratt (KMP) algorithm asks: “If I already matched ABAB, and the next char is a mismatch, how much of my progress can I keep?”

The Longest Prefix Suffix (LPS)

We pre-process the pattern to find the longest proper prefix that is also a suffix. P = “ABABC”

  • “A”: 0
  • “AB”: 0
  • “ABA”: 1 (Prefix “A”, Suffix “A”)
  • “ABAB”: 2 (Prefix “AB”, Suffix “AB”)
  • “ABABC”: 0

When a mismatch occurs at index j, we don’t start i over. We look at LPS[j-1] and resume from there. This ensures we never backtrack in the text string, guaranteeing O(N) worst-case performance.


Watch how the brute force algorithm slides the pattern window across the text. Notice how it resets the comparison for every shift.

Text (T)
Pattern (P)
Shift: 0
Comparisons: 0
Click Step to start searching.

6. Implementation

The implementation is straightforward but serves as a baseline for more advanced algorithms like Knuth-Morris-Pratt (KMP) or Rabin-Karp.

Java

public class NaiveSearch {
  public static int search(String text, String pattern) {
    int n = text.length();
    int m = pattern.length();

    // Loop through text
    for (int i = 0; i <= n - m; i++) {
      int j;
      // Loop through pattern
      for (j = 0; j < m; j++) {
        if (text.charAt(i + j) != pattern.charAt(j)) {
          break; // Mismatch found, break inner loop
        }
      }
      // If we reached the end of pattern, we found a match
      if (j == m) {
        return i;
      }
    }
    return -1; // Not found
  }

  public static void main(String[] args) {
    String text = "ABABDABACDABABCABAB";
    String pattern = "ABABCABAB";
    System.out.println("Found at index: " + search(text, pattern));
  }
}

Go

package main

import "fmt"

func Search(text, pattern string) int {
  n := len(text)
  m := len(pattern)

  // Loop through text
  for i := 0; i <= n-m; i++ {
    j := 0
    // Loop through pattern
    for j < m {
      if text[i+j] != pattern[j] {
        break // Mismatch
      }
      j++
    }
    // If j reached m, full match found
    if j == m {
      return i
    }
  }
  return -1
}

func main() {
  text := "ABABDABACDABABCABAB"
  pattern := "ABABCABAB"
  fmt.Println("Found at index:", Search(text, pattern))
}

Limitations

Notice in the visualization that when a mismatch happens, we shift by just one position. We often re-scan characters we’ve already seen. Advanced algorithms optimize this by shifting more than one step based on the mismatch.


7. Deep Dive Strategy Lab: String Searching

Case Study: Log Stream Keyword Detection (PEDALS Framework)

Imagine we are tasked with building the search component for a centralized logging platform (similar to Splunk or ELK). Millions of log lines are streaming in per second, and users can set up alerts to trigger when a specific error keyword (e.g., FATAL_DB_LOCK) appears in the stream.

Process Requirements

  • Core Task: Detect if a pattern string P exists within an incoming log stream text T.
  • Constraint: The stream is massive and continuous. We cannot afford O(N*M) worst-case time, as a cleverly formed or malicious log line could spike CPU usage and delay the entire pipeline.
  • Goal: Find a linear O(N) search algorithm.

Estimate

  • Let’s say the log string T is 10,000 characters long, and our pattern P is 15 characters.
  • A Naive search in the worst case (e.g., T=”AAAAA…B”, P=”AAAB”) would take 10,000 × 15 = 150,000 comparisons per line.
  • With an O(N) algorithm, we only do ~10,000 comparisons. This is a 15x speedup per line, crucial for high throughput.

Data Model

  • The text is represented as a character array (or byte array for ASCII logs).
  • We need an auxiliary data structure to optimize the search. For KMP, this is the LPS (Longest Prefix Suffix) array.

Architecture

We will adopt the KMP Algorithm to ensure strict O(N) bounds.

  1. Pre-process: Analyze the pattern P once to build the LPS array. This tells us exactly how far to safely skip ahead when a mismatch occurs.
  2. Stream Processing: As we read T, we use the LPS array to avoid re-evaluating characters we’ve already matched. We never backtrack the index pointer in T.

Localized Details

Why not just use indexOf() or regular expressions? Built-in library functions often use optimized versions of Naive search (like SIMD instructions to scan bytes quickly). While incredibly fast on average, they can still degrade to O(N*M) on pathological inputs. KMP mathematically guarantees O(N), acting as a safeguard against algorithmic complexity attacks in a public-facing system.

Scale

To scale this across a distributed system:

  • Partition the log stream across multiple worker nodes (e.g., via Kafka).
  • Each worker runs the O(N) search independently on its assigned chunk of logs.
  • If patterns frequently change (dynamic alerts), the O(M) pre-processing cost of KMP is still negligible compared to the massive log volume N.

Intuition Through Analogy

Think of KMP like reading a book with a smart bookmark. In the Naive approach, if you lose your train of thought while looking for a specific phrase, you start over from the very beginning of the page. With KMP, your “bookmark” (the LPS array) tells you exactly which word to jump back to, so you never have to re-read what you’ve already processed.