Idris2Doc : Data.ByteString.Search.Internal.Utils

Data.ByteString.Search.Internal.Utils

(source)
Utilities for string searching algorithms

Definitions

kmpBorders : (bs : ByteString) ->F1s (MArrays (S (lengthbs)) Nat)
  Computes the suffix-oriented KMP border table for a given pattern.

Each entry at index i (0 ≤ i ≤ length pattern) stores the length of the
longest proper prefix of the prefix pattern[0..i-1] that is also a
suffix. This “border” is used to determine how far to backtrack in
pattern matching when a mismatch occurs.

Unlike the standard KMP table, this table is suffix-oriented and
built in a descending, structurally recursive manner.

The table helps efficiently skip positions in the pattern during
substring search, while descending from longer prefixes to shorter ones.

Example: ANPANMAN"

Indices: 0..8

Prefixes: "" "A" "AN" "ANP" "ANPA" "ANPAN" "ANPANM" "ANPANMA" "ANPANMAN"
Borders: 0 0 0 0 1 2 0 1 2

Totality: total
Visibility: export
automaton : (bs : ByteString) ->F1s (MArrays (mult (plus (lengthbs) 1) 256) Nat)
  Builds a deterministic finite automaton (DFA) for pattern matching over a `ByteString`.

The automaton encodes transitions from (state, input byte) → next state,
allowing efficient streaming search for the pattern within input data.

It produces a flattened transition table of size `((length pattern) + 1) * 256`,
where 256 corresponds to all possible byte values (0–255).

States correspond to pattern prefixes:
- State 0: no match (empty prefix)
- State i: matched the first i bytes of the pattern
- State (length pattern): full match

Transition behavior is derived from the KMP border table (`kmpBorders`),
ensuring correct fallback transitions and eliminating redundant backtracking.

Example: "ANPANMAN"

These following equation is used to determine the "flat" index to build the automaton:

flatindex = (state ∗ alphabetsize) + charcode

Where:

state : Range from 0 to length of the input pattern

alphabetsize : All possible input characters (in this case extended ASCII, 8-bit range from 0 to 255)

charcode : Characters are interpreted via its ASCII code ('A' = 65, 'M' = 77, 'N' = 78, 'P' = 80, and so on)

| Flat index | State | Char code | Char | Meaning |
| ---------- | ----- | --------- | ---- | ------------- |
| 65 | 0 | 65 | 'A' | δ(0, 'A') = 1 |
| 321 | 1 | 65 | 'A' | δ(1, 'A') = 1 |
| 334 | 1 | 78 | 'N' | δ(1, 'N') = 2 |
| 577 | 2 | 65 | 'A' | δ(2, 'A') = 1 |
| 592 | 2 | 80 | 'P' | δ(2, 'P') = 3 |
| 833 | 3 | 65 | 'A' | δ(3, 'A') = 4 |
| 1089 | 4 | 65 | 'A' | δ(4, 'A') = 1 |
| 1102 | 4 | 78 | 'N' | δ(4, 'N') = 5 |
| 1345 | 5 | 65 | 'A' | δ(5, 'A') = 1 |
| 1357 | 5 | 77 | 'M' | δ(5, 'M') = 6 |
| 1601 | 6 | 65 | 'A' | δ(6, 'A') = 7 |
| 1857 | 7 | 65 | 'A' | δ(7, 'A') = 1 |
| 1870 | 7 | 78 | 'N' | δ(7, 'N') = 8 |
| 2113 | 8 | 65 | 'A' | δ(8, 'A') = 1 |

Totality: total
Visibility: export
occurrences : (bs : ByteString) ->F1s (MArrays256Int)
  Constructs a lookup table recording the last occurrence of each byte
in the given pattern.

For every byte value, the table stores the index of its last
occurrence within the pattern, excluding the final position.

This information allows for efficient computation of how far the search
window can safely shift after a mismatch.

When a mismatch occurs at pattern position (position in pattern) on byte (b),
the pattern can be shifted right by at least:

(position in pattern) - (last occurrence of b in initial pattern)

If the byte b does not appear anywhere in the pattern, the search
window can shift so that the pattern starts immediately after the
mismatched byte, resulting in a default shift of 1.

This table is typically used in Boyer–Moore–style pattern matching
algorithms to determine optimal skip distances after mismatches.

O((length of pattern) + (alphabet size))

Example: "ANPANMAN"

| Flat index / ASCII | char | value |
| ------------------ | ---- | ----- |
| 65 | 'A' | -6 |
| 77 | 'M' | -5 |
| 78 | 'N' | -4 |
| 80 | 'P' | -2 |

Totality: total
Visibility: export
suffixLengths : (bs : ByteString) ->F1s (MArrays (lengthbs) Int)
  Builds the table of suffix lengths for the given pattern.

The value at index `i` is the length of the longest common suffix
between the entire pattern and the prefix of the pattern ending at `i`.

Typically, most entries are 0. Only when the byte at position `i`
matches the final byte of the pattern can the value be positive.

The final entry (at `patEnd`) equals the pattern length, since the
pattern is identical to itself. In general, `0 <= ar[i] <= i + 1`.

To ensure linear preprocessing, the algorithm avoids the naive
quadratic approach by reusing information from previously identified
suffixes.

When the current index lies within an already known suffix, we align
that suffix with the end of the pattern and check whether it extends
beyond the current position. If so, we reuse the stored suffix length;
otherwise, we extend the suffix explicitly.

If the current index lies outside any known suffix, we compare against
the final byte of the pattern. If this yields a suffix of length > 1,
we enter the “known suffix” case for subsequent indices; otherwise,
we continue scanning normally.

Example : "ANPANMAN"

Raw suffix-lengths array used to compute the good suffix shift table

| i | pat[i] | matches pattern end? | diff = patEnd - i | nextI = i-1 | prevI (dec diff nextI) | ar[i] |
| - | ------ | -------------------- | ----------------- | ----------- | ---------------------- | ----- |
| 0 | A | No | - | - | - | 0 |
| 1 | N | Yes | 6 | 0 | -1 | 2 |
| 2 | P | No | - | - | - | 0 |
| 3 | A | No | - | - | - | 0 |
| 4 | N | Yes | 3 | 3 | 2 | 2 |
| 5 | M | No | - | - | - | 0 |
| 6 | A | No | - | - | - | 0 |
| 7 | N | - | - | - | - | 8 |

Totality: total
Visibility: export
suffixShifts : (bs : ByteString) ->F1s (MArrays (lengthbs) Int)
  Table of suffix-shifts

When a mismatch occurs at pattern position patpos, assumed to be not the
last position in the pattern, the suffix u of length (patend - patpos)
has been successfully matched.
Let c be the byte in the pattern at position patpos.

If the sub-pattern u also occurs in the pattern somewhere *not* preceded
by c, let upos be the position of the last byte in u for the last of
all such occurrences. Then there can be no match if the window is shifted
less than (patend - upos) places, because either the part of the string
which matched the suffix u is not aligned with an occurrence of u in the
pattern, or it is aligned with an occurrence of u which is preceded by
the same byte c as the originally matched suffix.

If the complete sub-pattern u does not occur again in the pattern, or all
of its occurrences are preceded by the byte c, then we can align the
pattern with the string so that a suffix v of u matches a prefix of the
pattern. If v is chosen maximal, no smaller shift can give a match, so
we can shift by at least (patlen - length v).

If a complete match is encountered, we can shift by at least the same
amount as if the first byte of the pattern was a mismatch, no complete
match is possible between these positions.

For non-periodic patterns, only very short suffixes will usually occur
again in the pattern, so if a longer suffix has been matched before a
mismatch, the window can then be shifted entirely past the partial
match, so that part of the string will not be re-compared.
For periodic patterns, the suffix shifts will be shorter in general,
leading to an O(strlen * patlen) worst-case performance.

To compute the suffix-shifts, we use an array containing the lengths of
the longest common suffixes of the entire pattern and its prefix ending
with position pos.

Example: "ANPANMAN"

| idx | suff[idx] | target = patEnd - suff[idx] | value = patEnd - idx | ar after write |
| --- | --------- | --------------------------- | -------------------- | ----------------- |
| 0 | 0 | 7 - 0 = 7 | 7 - 0 = 7 | [6,6,6,6,6,6,8,7] |
| 1 | 2 | 7 - 2 = 5 | 7 - 1 = 6 | [6,6,6,6,6,6,8,7] |
| 2 | 0 | 7 | 7 - 2 = 5 | [6,6,6,6,6,6,8,5] |
| 3 | 0 | 7 | 7 - 3 = 4 | [6,6,6,6,6,6,8,4] |
| 4 | 2 | 7 - 2 = 5 | 7 - 4 = 3 | [6,6,6,6,6,3,8,4] |
| 5 | 0 | 7 | 7 - 5 = 2 | [6,6,6,6,6,3,8,2] |
| 6 | 0 | 7 | 7 - 6 = 1 | [6,6,6,6,6,3,8,1] |

Totality: total
Visibility: export