Trie树数据结构

本文内容为转载翻译及汇总，相关文章包括：
http://dongxicheng.org/structure/trietree/
https://linux.thai.net/~thep/datrie/datrie.html

Trie树基础

Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA).
Trie树可以利用字符串的公共前缀来节约存储空间。如下图所示，该trie树用10个节点保存了5个字符串{A,C,D}, {A,E,G}, {A,E,L}, {A,E,M}, {K,M,N}：

在该trie树中，字符串A E，{A,E,G}和{A,E,M}的公共前缀是“A E”，因此可以只存储一份“A E”以节省空间。当然，如果系统中存在大量字符串且这些字符串基本没有公共前缀，则相应的trie树将非常消耗内存，这也是trie树的一个缺点。
Trie树的基本性质可以归纳为：
（1）根节点不包含字符，除根节点意外每个节点只包含一个字符。
（2）从根节点到某一个节点，路径上经过的字符连接起来，为该节点对应的字符串。
（3）每个节点的所有子节点包含的字符串不相同。

基本实现

字母树的插入（Insert）、删除（ Delete）和查找（Find）都非常简单，用一个一重循环即可，即第i 次循环找到前i 个字母所对应的子树，然后进行相应的操作。实现这棵字母树，我们用最常见的数组保存（静态开辟内存）即可，当然也可以开动态的指针类型（动态开辟内存）。至于结点对儿子的指向，一般有三种方法：
1、对每个结点开一个字母集大小的数组，对应的下标是儿子所表示的字母，内容则是这个儿子对应在大数组上的位置，即标号；
2、对每个结点挂一个链表，按一定顺序记录每个儿子是谁；
3、使用左儿子右兄弟表示法记录这棵树。三种方法，各有特点。第一种易实现，但实际的空间要求较大；第二种，较易实现，空间要求相对较小，但比较费时；第三种，空间要求最小，但相对费时且不易写。
下面给出动态开辟内存的实现：

#define MAX_NUM 26
enum NODE_TYPE{ //"COMPLETED" means a string is generated so far.
  COMPLETED,
  UNCOMPLETED
};
struct Node {
  enum NODE_TYPE type;
  char ch;
  struct Node* child[MAX_NUM]; //26-tree->a, b ,c, .....z
};
 
struct Node* ROOT; //tree root
 
struct Node* createNewNode(char ch){
  // create a new node
  struct Node *new_node = (struct Node*)malloc(sizeof(struct Node));
  new_node->ch = ch;
  new_node->type == UNCOMPLETED;
  int i;
  for(i = 0; i < MAX_NUM; i++)
    new_node->child[i] = NULL;
  return new_node;
}
 
void initialization() {
//intiazation: creat an empty tree, with only a ROOT
ROOT = createNewNode(' ');
}
 
int charToindex(char ch) { //a "char" maps to an index<br>
return ch - 'a';
}
 
int find(const char chars[], int len) {
  struct Node* ptr = ROOT;
  int i = 0;
  while(i < len) {
   if(ptr->child[charToindex(chars[i])] == NULL) {
   break;
  }
  ptr = ptr->child[charToindex(chars[i])];
  i++;
  }
  return (i == len) && (ptr->type == COMPLETED);
}
 
void insert(const char chars[], int len) {
  struct Node* ptr = ROOT;
  int i;
  for(i = 0; i < len; i++) {
   if(ptr->child[charToindex(chars[i])] == NULL) {
    ptr->child[charToindex(chars[i])] = createNewNode(chars[i]);
  }
  ptr = ptr->child[charToindex(chars[i])];
}
  ptr->type = COMPLETED;
}

Trie树的高级实现

Trie作为一种高效的索引方式，代表一种过度表，行代表状态，列代表过度表标签，在通过二维数组来存储计算时会有内存的浪费，其中大部分的节点只有很少的树枝，使得表中的空值很多，因此需要进行压缩存储。
In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.

1.Tripple-Array Trie Tripple-Array 的结构由以下三部分组成：

base:在base中的每一个元素对应着Trie树中的每一个节点，对于一个节点s，base［s］时起始索引对于过度表中某一行的节点s包含于next和check中。
next：该数组和next配合，提供了稀疏向量的存储池
check：该数组和next平行工作，标记了每个网格中next的所有者

Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

check[base[s] + c] = s
next[base[s] + c] = t

根据定义1，确定状态s 并输入字符c

  t := base[s] + c;

  if check[t] = s then
      next state := next[t]
  else
      fail
  endif

2.Double-Array Trie 可以采用双数组（Double-Array）实现。利用双数组可以大大减小内存使用量,在双数组的实现中node被直接通过check和next链接。
Definition 2. For a transition from state s to t which takes character c as the input, the condition maintained in the double-array trie is:

check[base[s] + c] = s
base[s] + c = t

根据定义2，确定状态s 并输入字符c

t := base[s] + c;

  if check[t] = s then
      next state := t
  else
      fail
  endif

前缀树，是一种用于快速检索的多叉树结构

Trie树基础

基本实现

Trie树的高级实现

CATALOG