—Abstract—This scheme considers a text document made up of character such as letters of the alphabet, punctuation marks and special characters/symbols. If we represent each character that makes up the document as c 1
, … , cn
, compression is achieved by taking each of these characters that makes up the text one at a time and then search first, for the position of the last occurrence of a particular character being considered for
compression together with the length of its digits, and then, starting from the beginning of the text file, note all the positions where this character has occurred. The positions of occurrence of this character while the search is on, is made equal to the length of the digit of the last occurrence of the character by padding it with zeroes to the left of the most significant bit, if need be. Concatenate the values representing the positions of the occurrence of a character and covert the concatenated string into a decimal value. Divide this value successively by 2 until the result lies between one and less than two. Store the quotient obtained from these divisions and the sum of the number of times the division was carried out as an index k. Decompression is the reverse of the steps just described, and this is achieved by taking each character; obtained their corresponding quotient (q), index k and length li. To recover the decimal positions of the concatenated values, we multiply the quotient (q) by 2k
. We then use the length of this particular character to identify positions where they occurred. This scheme, which is lossless compression, has its ratio tending to zero when the text file is very large.
—Compression, compression ratio,
decompression, lossless, scheme, text file.
Sunday Eric Adewumi is with the Federal University Lokoja, Nigeria (email:
Cite:Sunday Eric Adewumi, "Character Analysis Scheme for Compressing Text Files," International Journal of Computer Theory and Engineering vol. 7, no. 5, pp. 362-365, 2015.