Square: Shortest Unique Prefix
Problem:
Given a list of words, return the shortest unique prefix of each word. For example, given the list:
dog
cat
apple
apricot
fist
Return the list:
d
c
app
apr
f
Simple Hash-based Solution:
It’s easy to develop a simple hash-based solution. Define a std::unordered_map
from string to list of integers (each integer is index of a word in the original input). Add every prefix of every word in the map, and later iterate over the map to find all keys corresponding to singleton lists.
What’s the complexity of this algorithm? Let’s assume that an average length of the string is O(M)
and there are O(N)
strings. You are storing every prefix of every string in an unordered_map
and querying it later. Since there are O(M)
prefixes for every string, the total complexity of the algorithm is O(MN)
. This, of course, assumes that the insert and search through unordered_map
takes constant time.
How about the space complexity? The total character space taken by all prefixes of a string is O(M²). Since there are N
strings, the total space complexity is O(M²N).
We also run a risk of hash collision and unordered_map
search/insert running in non-constant time. Can we improve upon this algorithm?
Your Own Data-structure-based Solution:
In this section, we will define a new data structure, Prefix Tree
, which will
- Save space by cleverly storing prefixes of all strings
- Save time by early stopping in the search of prefixes, and also by avoiding hash collisions.
Here is an outline of how the tree looks like:
- Every character in a string forms a node of the tree.
- The next character to the right in the string forms its child node.
- Thus, every path in the tree from root to a node spells a prefix of the tree.
- Every node contains the index of the word in the array it belongs to.
- If a node belongs to multiple indices, we keep the first of them, and set an additional boolean flag,
repeated
to true.
Prefix tree for a two-word example [apple, apricot]
looks like this:
root
|
'a' (repeated=true)
|
'p' (repeated=true)
|
-----------------------------------
| |
'p' (repeated=false) 'r' (repeated=false)
| |
'l' (repeated=false) 'i' (repeated=false)
| |
'e' (repeated=false) 'c' (repeated=false)
|
'o' (repeated=false)
|
't' (repeated=false)
Here is the struct definition of the Prefix Tree:
Here is how we can construct the PrefixTree from a vector of strings (Let’s assume that the characters in the words only come from lowercase English letters {a, b, ..., z}
):
Finally, here is a BFS-like algorithm to search through the prefix tree, and populate the output shortest-prefix array:
- Insert the
root
of the tree in a BFS queue. - At every iteration, pop the front of the queue (call it
curr
). iterate over all the children ofcurr
. If a child nodeq
belongs to a unique word (repeated=false
), construct the unique prefix corresponding to the path from root to that node, and save it in theword_index
th location in the output. - Else, push the new node back in the queue.
What’s the complexity of this algorithm? The algorithm has two components: (1) Prefix tree creation, and (2) prefix tree traversal. In both components, we visit every character for every word once. For every character, we perform a constant amount of work (updating a few variables, or iterating over a constant-size array). As a result, the complexity of this algorithm is O(MN)
. And here is the kick: we store a constant amount of data per character per word (The Node
struct has O(1)
size)!. As a result, the space complexity of this algorithm is also O(MN)
.
Testing:
Here are a few test cases in increasing order of difficulty:
- Empty word-list
- Singleton list
- A list where every word is a prefix of the next word
- A random list, e.g. the one given in the problem statement
Originally published at https://cppcodingzen.com on September 6, 2020.