algorithm - Searching and appending concatenated strings -
i have file contains concatenated strings.
find_or_add(string)
either:
- returns offset of occurrence of string in file (not first)
- adds of tail of string file necessary file contain string (and returns offset of string in file).
pseudocode:
file.init() // file == "" file.find_or_add("cat") // file == "cat", returns 0 file.find_or_add("able") // file == "catable", returns 3 file.find_or_add("table") // file == "catable", returns 2 file.find_or_add("tables") // file == "catables", returns 2 file.find_or_add("spigot") // file == "catablespigot", returns 7 file.find_or_add("pig") // file == "catablespigot", returns 8
what algorithm/structure should looking @ 'summarise' file in memory, , allow required operations in @ o(log n)?
assume file bigger ram.
language not important, can read pseudocode, c, java, python, javascript , haskell.
the suffix array , suffix tree induce memory problems. (they larger text, if cut them @ depth since need store suffixids in structure).
you create set of files represent ids of prefixes. store length 2 prefixes in different file , keep sorted. file contain on average 1/26^2 of suffix ids. have file aa.txt , ab.txt , fort. entries in file keep sorted (suffix array). each time want lookup use load small file, sorted , check. complexity o(n) (you have load file constant controllable fraction of text) can tune prefactor best performance. in 5 gb file example if use length 2 prefixes there have set of 8 mb sized files, prefixlength 3 around 320 kb , fort..
Comments
Post a Comment