Algorithms
to Improve the Efficiency of Data Compression and Caching on Wide-Area Networks
Amar Mukherjee
(Principal Investigator)
Department of Computer Science
School of Electrical Engineering and Computer Science,
University of Central Florida,
Orlando, FL-32816.
Phone: (407) 823-2763
Fax: (407) 823-5419
Email: amar@cs.ucf.edu
http://vlsi.cs.ucf.edu/director.html
http://vlsi.cs.ucf.edu/IDMStar2003rept.html
Nan Zhang, Graduate Research Assistant
Weifeng Sun, Graduate Research Assistant
Ravi Vijaya Satya, Graduate Research Assistant
· Duration: Period of performance of entire project, 10/01/1999-9/30/2002
· No-cost extension : 12/30/2003
· Current Reporting Period: 6/1/2002-5/30/2003
Data compression, lossless text compression, bandwidth, bzip2, PPMD, LIPT, GZIP, M5Zip, StarZip, text transformation.
The goal of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression for archival storage and transmission over the Internet. The approach consists of pre-processing the text to exploit the natural redundancyof English language to obtain an intermediate transformed form via the use of a dictionary and then compressing it using existing compression algorithms. Several classical compression algorithms such as Huffman, arithmetic, LZ-family (gzip and compress) as well as some of the recent algorithms such as Bzip2, PPM family, DMC, YBS, DC, RK, PPMonstr and recent versions of Bzip2 are used as the backend compression algorithms. The performance of our transforms in combination with these algorithms are compared with the original set of algorithms, taking into account both compression, computation and storage overhead. Information theoretic explanation of experimental results are given. The impact of the research on the future of information technology is to develop data delivery systems with efficient utilization of communication bandwidth and conservation of archival storage. We also develop infrastructure software for rapid delivery of compressed data over the Internet and an online compression utility website as a test bench for comparing various kinds of compression algorithms. The site (vlsi.cs.ucf.edu) will be linked to a very well known compression website which contains the Canterbury and Calgary text corpus. The experimental research is linked to educational goals by rapid dissemination of results via reports, conference and journal papers and doctoral dissertation and master's thesis, and transferring the research knowledge into the graduate curriculum via teaching of a graduate level course on data compression.
Objectives and Targeted Activities
The basic philosophy of our compression
algorithm is to transform the text into some intermediate form, which can be
compressed with better efficiency. The transformation is designed to exploit
the natural redundancy of the language. We have developed a class of such
transformations each giving better compression performance over the previous
ones and all of them giving better compression over most of the current and
classical compression algorithms (viz. Huffman, Arithmetic and Gzip (based on
LZ77), Bzip2 (based on Burrows Wheeler Transform), the class of PPM (Partial
Predicate Match) algorithms (such as PPMD), RK, DC, YBS and PPMonstr). We also
measured the execution times needed to produce the pre-processing and its
impact on the total execution time. We developed several transforms ( Star(*)
and LPT) and two variations of LPT called RLPT and SCLPT. We developed four new
transforms called LIPT, ILPT, LIT and NIT, which produce better results in
terms of both compression ratio and execution times. The algorithms use a fixed
amount of storage overhead in the form of a word dictionary for the particular
corpus of interest and must be shared by the sender and receiver of the
compressed files. Typical size of dictionary for the English language is about
1 MB and can be downloaded once along with application programs. If the
compression algorithms are going to be used over and over again, which is true
in all practical applications, the amortized storage overhead is negligibly
small. We also develop efficient data structures to expedite access to the
dictionaries and propose memory management techniques using caching for use in
the context of the Internet technologies. Realizing that certain on-line
algorithms might prefer not to use a pre-assigned dictionary, we have been
developing new algorithms to obtain the transforms dynamically with no
dictionary, with small dictionaries (7947 words and 10000 words) and studying
the effect of the size of the dictionaries on compression performance. We call
this family of algorithms M5zip.
During this reporting period, we
developed a new transform engine StarNT for a multi-corpora text compression
system. StarNT achieves a superior compression ratio than almost all the other
recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless
text transform. The main idea is to recode each English word with a
representation of no more than three symbols. This transform not only maintains
most of the original context information at the word level, but also provides
some kind of 'artificial' but strong context. The transform also exploits the
distribution of lengths of the words and frequency to reduce the size of the
transformed text which is provided to a backend compressor. Ternary search tree
is used to store the transform dictionary in the transform encoder. This data
structure provides a very fast transform encoding with a low storage overhead.
Another novel idea of StarNT is to treat the transformed codewords asthe offset
of words in the transform dictionary. Thus the time complexity of O(1) for
searching a word in the dictionary is achieved in the transform decoder.
The transform encoder and transform
decoder share the same dictionary, which is prepared off-line according to the
following rules: 1) Most frequently used words (i.e. 312 words) are listed in
the beginning of the dictionary according to their frequencies. 2) Other words
are sorted according to their lengths and frequencies. Words with longer
lengths are stored after words with shorter lengths. If two words have the same
length, word with higher frequency of occurrence is listed after word with
lower frequency. 3) To achieve better compression performance for backend data
compressor, only letters [a..zA..Z] are used to represent the codeword.
One other angle of study is to adapt
dynamically to domain-specific corpus (viz. biological, physics, computer
science, XML documents, html documents). We experimentally measure the
performance of our proposed algorithms and compare with all other algorithms
using three corpuses: Calgary, Canterbury and Gutenberg corpus. Finally, we
develop an information theory based explanation of the performance of our
algorithms.
The major findings for this reporting
period are as follows. We introduce StarNT, a multi-corpora text compression system,
together with its transform engine StarNT. StarNT achieves a superior
compression ratio than almost all the other recent efforts based on BWT and
PPM. StarNT is a dictionary-based fast lossless text transform. The main idea
is to recode each English word with a representation of no more than three
symbols. This transform not only maintains most of the original context
information at the word level, but also provides some kind of 'artificial' but
strong context. The transform also exploits the distribution of lengths of the
words and frequency to reduce the size of the transformed text which is
provided to a backend compressor. Ternary search tree is used to store the
transform dictionary in the transform encoder. This data structure provides a
very fast transform encoding with a low storage overhead. Another novel idea of
StarNT is to treat the transformed codewords as the offset of words in the
transform dictionary. Thus the time complexity of O(1) for searching a word in
the dictionary is achieved in the transform decoder.
The transform encoder and transform
decoder share the same dictionary, which is prepared off-line according to the
following rules: 1) Most frequently used words (i.e. 312 words) are listed in
the beginning of the dictionary according to their frequencies. 2) Other words
are sorted according to their lengths and frequencies. Words with longer
lengths are stored after words with shorter lengths. If two words have the same
length, word with higher frequency of occurrence is listed after word with
lower frequency. 3) To achieve better compression performance for backend data
compressor, only letters [a..zA..Z] are
used to represent the codeword.
Experimental results show that the
average compression time has improved by orders of magnitude compared to our
previous dictionary based transform LIPT and for large files,viz. 400Kbytes or
more, the compression time is no worse than those obtained by bzip2 and gzip,
and is much faster than PPMD. Meanwhile, the overhead in the decompression phase
is negligible. We draw a significant conclusion that bzip2 in conjunction with
this transform is better than both gzip and PPMD both in time complexity and
compression performance.
One of the key features of StarNT compression
system is to develop domain specific dictionaries and provide tools to develop
such dictionaries. Results from five corpora show that StarZip achieves an
average improvement in compression performance (in terms of BPC) of 13% over
bzip2 -9, 19% over gzip -9, and 10% over PPMD. Further details about the StarNT
transform and experimental results can be found in the attached document.
Professor Amar Mukherjee has been invited to give a number of colloquium talks
on text compression at several universities in California namely University of
California at Santa Cruise, Riverside, Santa Barbara, San Diego, and Davis. He
was also invited to give talks at IBM Almaden Research Center at San Jose,
California, and Oregon State University at Corvallis, Oregon. Professor
Mukherjee was also invited to give research seminars on text compression by the
Indian Statistical Institute and the Indian Institute of Technology, Kharagpur.
We have introduced a graduate level course CAP 5015 entitled "Multimedia Compression on the Internet" (http://www.cs.ucf.edu/courses/cap5015/) based on the research we have been conducting on data compression. The course material includes both text and image compression, including content from research supported by current NSF grant.
Book(s)
of other one-time publications(s):
Amar
Mukherjee and Fauzia Awan, "Text Compression", bibl. 0-12-620861-1
Elsvier Science/Academic Press, (2003).
(Ed. Khalid Sayood, "Lossless Compression Handbook Academic
Press,2003)
In the last decade, we have seen an unprecendent explosion of text information transfer through Emails, Web Browsing, and digital library and information retrieval systems. It is estimated that this growth will be 100% every year. In all of this text data competes for 45% of the total Internet traffic due to downloading of web pages, movements of emails, news groups, forums etc. With continuous increasing use of the Internet the efficient use of available resources, in particular hardware resources, has been a primary concern. One way of improving performance is by compressing text data using better compression methods without much degradation of encoding and decoding times and to ensure that as little data as possible is sent in response to the client’s request. We are developing software to integrate new compression schemes into a network conscious system which is capable of sensing the traffic on the current server on which these algorithms are hosted, and make an appropriate decision on what method should be used to compress the information before transmitting it.
1. I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, 2nd Edition, Morgan Kaufmann Publishers, 1999.
2. D. Salomon, Data Compression, 2nd Edition, Springer Verlag, 2000.
3. K. Sayood, Introduction to Data Compression, 2nd Edition, Morgan Kaufmann, 2000.
4. Using Compression on Webservers IIS 5.0 http://www.microsoft.com/TechNet/iis/httpcomp.asp
5. Compression for HTML Streams http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
We developed compressed domain pattern matching algorithms based on sorted context and laid the foundation of a future research proposals in this field. The papers are concerned with searching for pattern in the compressed text without or only partial decompression. The sorted context of the BWT transform gives rise to very efficient binary and q-gram based search strategy for performing exact and approximate pattern matching operations. A SMALL ITR proposal to NSF based on this research foundation is currently under review.