字典树 & 字符串哈希

10 Jul 2020 2342字 8分
CC BY 4.0 （除特别声明或转载文章外）
如果这篇博客帮助到你，可以请我喝一杯咖啡~

题目

面试题 17.13. Re-Space LCCI

Oh, no! You have accidentally removed all spaces, punctuation, and capitalization in a lengthy document. A sentence like “I reset the computer. It still didn’t boot!” became “iresetthecomputeritstilldidntboot’’. You’ll deal with the punctuation and capitalization later; right now you need to re-insert the spaces. Most of the words are in a dictionary but a few are not. Given a dictionary (a list of strings) and the document (a string), design an algorithm to unconcatenate the document in a way that minimizes the number of unrecognized characters. Return the number of unrecognized characters.

Note: This problem is slightly different from the original one in the book.

Example:

Input: 
dictionary = ["looked","just","like","her","brother"]
sentence = "jesslookedjustliketimherbrother"
Output:  7
Explanation:  After unconcatenating, we got "jess looked just like tim her brother", which containing 7 unrecognized characters.

Note:

0 <= len(sentence) <= 1000
The total number of characters in dictionary is less than or equal to 150000.
There are only lowercase letters in dictionary and sentence.

方法一：Tire + 动态规划

字典树 Tire 。

是一种最大程度利用多个字符串前缀信息的数据结构，它可以在 O(w) 的时间复杂度内判断一个字符串是否是一个字符串集合中某个字符串的前缀。w代表字符串的长度。

简单来说，是一种树形结构，保存的是每个节点往后的生长情况。

感觉第一次这样写树的数据结构，有点怪，可以在类里面直接实现一个树结构？真是学到了！

class Tire{
public:
    Tire* next[26] = {nullptr};
    bool isEnd;

    Tire(){
        isEnd = false;
    }
    void insert(string& s){
        Tire* cur = this;

        for(int i=s.length()-1;i>=0;--i){
            Tire* nt = cur->next[s[i]-'a'];
            if(nt == nullptr){
                cur->next[s[i]-'a'] = new Tire();
            }
            cur = cur->next[s[i]-'a'];
        }
        cur->isEnd = true;
    }

    ~Tire(){
        Tire* cur = this;
        clear(cur);
    }
    void clear(Tire* & cur){
        if(cur == nullptr) return;
        for(int i=0;i<26;++i){
            if(cur->next[i]){
                Tire* t = cur->next[i];
                clear(t);
            }
        }
        delete cur;
    }

};

class Solution{
public:
    int replace(vector<string>& dictonary, string sentence){
        int n = sentence.length(), inf = INT_MAX;
        Tire* root = new Tire();

        for(auto& word: dictonary){
            root->insert(word);
        }

        vector<int> dp(n+1,inf);
        dp[0] = 0;

        for(int i=1;i<=n;++i){
            dp[i] = dp[i-1] + 1;
            Tire* curPos = root;

            for(int j=i;j>=1;--j){
                int t = sentence[j-1] - 'a';
                if(curPos->next[t] == nullptr) break;

                if(curPos->next[t]->isEnd){
                    dp[i] = min(dp[j-1],dp[i]);
                }

                if(dp[i] == 0){
                    break;
                }

                curPos = curPos->next[t];

            }
        }
        return dp[n];
    }


};

字符串哈希

把每个字符串映射成一个值；

取一个 base 进制数，比如 abc，取 3 进制数，012；然后算出该数；但同时又设定一个 mod ；取算出来的数取mod后的结果，因为算出来的数可能会太大以至于超出表示范围。

降低哈希碰撞：选择 base >

(E为字符集合)； mod 为大于 base 的一个质数；当然为了降低哈希碰撞的概率，可以多取几个 mod，比较多个哈希值来判断字符串是否相等；

比如

static constexpr long long _MOD_ = (1LL << 31) - 1;
static constexpr long long _BASE_ = 173;