Data.ByteString.Search.Internal.Utils

(source)

Utilities for string searching algorithms

Definitions

kmpBorders : (bs : ByteString) -> F1 s (MArray s (S (length bs)) Nat)

  Computes the suffix-oriented KMP border table for a given pattern.
  
  Each entry at index i (0 ≤ i ≤ length pattern) stores the length of the
  longest proper prefix of the prefix pattern[0..i-1] that is also a
  suffix. This “border” is used to determine how far to backtrack in
  pattern matching when a mismatch occurs.
  
  Unlike the standard KMP table, this table is suffix-oriented and
  built in a descending, structurally recursive manner.
  
  The table helps efficiently skip positions in the pattern during
  substring search, while descending from longer prefixes to shorter ones.
  
  Example: ANPANMAN"
  
  Indices: 0..8
  
  Prefixes: ""   "A"   "AN"   "ANP"  "ANPA"  "ANPAN"  "ANPANM"  "ANPANMA"  "ANPANMAN"
  Borders:  0    0     0      0      1       2        0         1          2

Totality: total
Visibility: export

automaton : (bs : ByteString) -> F1 s (MArray s (mult (plus (length bs) 1) 256) Nat)

  Builds a deterministic finite automaton (DFA) for pattern matching over a `ByteString`.
  
  The automaton encodes transitions from (state, input byte) → next state,
  allowing efficient streaming search for the pattern within input data.
  
  It produces a flattened transition table of size `((length pattern) + 1) * 256`,
  where 256 corresponds to all possible byte values (0–255).
  
  States correspond to pattern prefixes:
  - State 0: no match (empty prefix)
  - State i: matched the first i bytes of the pattern
  - State (length pattern): full match
  
  Transition behavior is derived from the KMP border table (`kmpBorders`),
  ensuring correct fallback transitions and eliminating redundant backtracking.
  
  Example: "ANPANMAN"
  
  These following equation is used to determine the "flat" index to build the automaton:
  
  flatindex = (state ∗ alphabetsize) + charcode
  
  Where:
  
  state : Range from 0 to length of the input pattern
  
  alphabetsize : All possible input characters (in this case extended ASCII, 8-bit range from 0 to 255)
  
  charcode : Characters are interpreted via its ASCII code ('A' = 65, 'M' = 77, 'N' = 78, 'P' = 80, and so on)
  
  | Flat index | State | Char code | Char | Meaning       |
  | ---------- | ----- | --------- | ---- | ------------- |
  | 65         | 0     | 65        | 'A'  | δ(0, 'A') = 1 |
  | 321        | 1     | 65        | 'A'  | δ(1, 'A') = 1 |
  | 334        | 1     | 78        | 'N'  | δ(1, 'N') = 2 |
  | 577        | 2     | 65        | 'A'  | δ(2, 'A') = 1 |
  | 592        | 2     | 80        | 'P'  | δ(2, 'P') = 3 |
  | 833        | 3     | 65        | 'A'  | δ(3, 'A') = 4 |
  | 1089       | 4     | 65        | 'A'  | δ(4, 'A') = 1 |
  | 1102       | 4     | 78        | 'N'  | δ(4, 'N') = 5 |
  | 1345       | 5     | 65        | 'A'  | δ(5, 'A') = 1 |
  | 1357       | 5     | 77        | 'M'  | δ(5, 'M') = 6 |
  | 1601       | 6     | 65        | 'A'  | δ(6, 'A') = 7 |
  | 1857       | 7     | 65        | 'A'  | δ(7, 'A') = 1 |
  | 1870       | 7     | 78        | 'N'  | δ(7, 'N') = 8 |
  | 2113       | 8     | 65        | 'A'  | δ(8, 'A') = 1 |

Totality: total
Visibility: export

occurrences : (bs : ByteString) -> F1 s (MArray s 256 Int)

  Constructs a lookup table recording the last occurrence of each byte
  in the given pattern.
  
  For every byte value, the table stores the index of its last
  occurrence within the pattern, excluding the final position.  
  
  This information allows for efficient computation of how far the search
  window can safely shift after a mismatch.
  
  When a mismatch occurs at pattern position (position in pattern) on byte (b),
  the pattern can be shifted right by at least:
  
  (position in pattern) - (last occurrence of b in initial pattern)
  
  If the byte b does not appear anywhere in the pattern, the search
  window can shift so that the pattern starts immediately after the
  mismatched byte, resulting in a default shift of 1.
  
  This table is typically used in Boyer–Moore–style pattern matching
  algorithms to determine optimal skip distances after mismatches.
  
  O((length of pattern) + (alphabet size))
  
  Example: "ANPANMAN"
  
  | Flat index / ASCII | char | value |
  | ------------------ | ---- | ----- |
  |        65          | 'A'  |    -6 |
  |        77          | 'M'  |    -5 |
  |        78          | 'N'  |    -4 |
  |        80          | 'P'  |    -2 |

Totality: total
Visibility: export

suffixLengths : (bs : ByteString) -> F1 s (MArray s (length bs) Int)

  Builds the table of suffix lengths for the given pattern.
  
  The value at index `i` is the length of the longest common suffix
  between the entire pattern and the prefix of the pattern ending at `i`.
  
  Typically, most entries are 0. Only when the byte at position `i`
  matches the final byte of the pattern can the value be positive.
  
  The final entry (at `patEnd`) equals the pattern length, since the
  pattern is identical to itself. In general, `0 <= ar[i] <= i + 1`.
  
  To ensure linear preprocessing, the algorithm avoids the naive
  quadratic approach by reusing information from previously identified
  suffixes.
  
  When the current index lies within an already known suffix, we align
  that suffix with the end of the pattern and check whether it extends
  beyond the current position. If so, we reuse the stored suffix length;
  otherwise, we extend the suffix explicitly.
  
  If the current index lies outside any known suffix, we compare against
  the final byte of the pattern. If this yields a suffix of length > 1,
  we enter the “known suffix” case for subsequent indices; otherwise,
  we continue scanning normally.
  
  Example : "ANPANMAN"
  
  Raw suffix-lengths array used to compute the good suffix shift table
  
  | i | pat[i] | matches pattern end? | diff = patEnd - i | nextI = i-1 | prevI (dec diff nextI) | ar[i] |
  | - | ------ | -------------------- | ----------------- | ----------- | ---------------------- | ----- |
  | 0 |    A   |          No          |         -         |      -      |            -           |   0   |
  | 1 |    N   |          Yes         |         6         |      0      |           -1           |   2   |
  | 2 |    P   |          No          |         -         |      -      |            -           |   0   |
  | 3 |    A   |          No          |         -         |      -      |            -           |   0   |
  | 4 |    N   |          Yes         |         3         |      3      |            2           |   2   |
  | 5 |    M   |          No          |         -         |      -      |            -           |   0   |
  | 6 |    A   |          No          |         -         |      -      |            -           |   0   |
  | 7 |    N   |          -           |         -         |      -      |            -           |   8   |

Totality: total
Visibility: export

suffixShifts : (bs : ByteString) -> F1 s (MArray s (length bs) Int)

  Table of suffix-shifts
  
  When a mismatch occurs at pattern position patpos, assumed to be not the
  last position in the pattern, the suffix u of length (patend - patpos)
  has been successfully matched.
  Let c be the byte in the pattern at position patpos.
  
  If the sub-pattern u also occurs in the pattern somewhere *not* preceded
  by c, let upos be the position of the last byte in u for the last of
  all such occurrences. Then there can be no match if the window is shifted
  less than (patend - upos) places, because either the part of the string
  which matched the suffix u is not aligned with an occurrence of u in the
  pattern, or it is aligned with an occurrence of u which is preceded by
  the same byte c as the originally matched suffix.
  
  If the complete sub-pattern u does not occur again in the pattern, or all
  of its occurrences are preceded by the byte c, then we can align the
  pattern with the string so that a suffix v of u matches a prefix of the
  pattern. If v is chosen maximal, no smaller shift can give a match, so
  we can shift by at least (patlen - length v).
  
  If a complete match is encountered, we can shift by at least the same
  amount as if the first byte of the pattern was a mismatch, no complete
  match is possible between these positions.
  
  For non-periodic patterns, only very short suffixes will usually occur
  again in the pattern, so if a longer suffix has been matched before a
  mismatch, the window can then be shifted entirely past the partial
  match, so that part of the string will not be re-compared.
  For periodic patterns, the suffix shifts will be shorter in general,
  leading to an O(strlen * patlen) worst-case performance.
  
  To compute the suffix-shifts, we use an array containing the lengths of
  the longest common suffixes of the entire pattern and its prefix ending
  with position pos.
  
  Example: "ANPANMAN"
  
  | idx | suff[idx] | target = patEnd - suff[idx] | value = patEnd - idx |    ar after write |
  | --- | --------- | --------------------------- | -------------------- | ----------------- |
  |   0 |         0 |                   7 - 0 = 7 |            7 - 0 = 7 | [6,6,6,6,6,6,8,7] |
  |   1 |         2 |                   7 - 2 = 5 |            7 - 1 = 6 | [6,6,6,6,6,6,8,7] |
  |   2 |         0 |                           7 |            7 - 2 = 5 | [6,6,6,6,6,6,8,5] |
  |   3 |         0 |                           7 |            7 - 3 = 4 | [6,6,6,6,6,6,8,4] |
  |   4 |         2 |                   7 - 2 = 5 |            7 - 4 = 3 | [6,6,6,6,6,3,8,4] |
  |   5 |         0 |                           7 |            7 - 5 = 2 | [6,6,6,6,6,3,8,2] |
  |   6 |         0 |                           7 |            7 - 6 = 1 | [6,6,6,6,6,3,8,1] |

Totality: total
Visibility: export