Algorithms to Improve the Efficiency of Data Compression and Caching on Wide-Area Networks

 

Project Award Number 9977336

 

Principal Investigator

 

Amar Mukherjee (Principal Investigator)

Department of Computer Science

School of Electrical Engineering and Computer Science,

University of Central Florida,

Orlando, FL-32816.

Phone: (407) 823-2763

Fax: (407) 823-5419

Email: amar@cs.ucf.edu

http://vlsi.cs.ucf.edu/director.html

 

WWW Page

http://vlsi.cs.ucf.edu/IDMStar2003rept.html

 

List of Supported Students

Nan Zhang, Graduate Research Assistant

Weifeng Sun, Graduate Research Assistant

Ravi Vijaya Satya, Graduate Research Assistant

 

Project Award Information

        Duration: Period of performance of entire project, 10/01/1999-9/30/2002

        No-cost extension : 12/30/2003

        Current Reporting Period: 6/1/2002-5/30/2003

Keywords

Data compression, lossless text compression, bandwidth, bzip2, PPMD, LIPT, GZIP, M5Zip, StarZip, text transformation.

 

Project Summary

 

The goal of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression for archival storage and transmission over the Internet. The approach consists of pre-processing the text to exploit the natural redundancyof English language to obtain an intermediate transformed form via the use of a dictionary and then compressing it using existing compression algorithms. Several classical compression algorithms such as Huffman, arithmetic, LZ-family (gzip and compress) as well as some of the recent algorithms such as Bzip2, PPM family, DMC, YBS, DC, RK, PPMonstr and recent versions of Bzip2 are used as the backend compression algorithms. The performance of our transforms in combination with these algorithms are compared with the original set of algorithms, taking into account both compression, computation and storage overhead. Information theoretic explanation of experimental results are given. The impact of the research on the future of information technology is to develop data delivery systems with efficient utilization of communication bandwidth and conservation of archival storage. We also develop infrastructure software for rapid delivery of compressed data over the Internet and an online compression utility website as a test bench for comparing various kinds of compression algorithms. The site (vlsi.cs.ucf.edu) will be linked to a very well known compression website which contains the Canterbury and Calgary text corpus. The experimental research is linked to educational goals by rapid dissemination of results via reports, conference and journal papers and doctoral dissertation and master's thesis, and transferring the research knowledge into the graduate curriculum via teaching of a graduate level course on data compression.

 
Objectives and Targeted Activities
 

The basic philosophy of our compression algorithm is to transform the text into some intermediate form, which can be compressed with better efficiency. The transformation is designed to exploit the natural redundancy of the language. We have developed a class of such transformations each giving better compression performance over the previous ones and all of them giving better compression over most of the current and classical compression algorithms (viz. Huffman, Arithmetic and Gzip (based on LZ77), Bzip2 (based on Burrows Wheeler Transform), the class of PPM (Partial Predicate Match) algorithms (such as PPMD), RK, DC, YBS and PPMonstr). We also measured the execution times needed to produce the pre-processing and its impact on the total execution time. We developed several transforms ( Star(*) and LPT) and two variations of LPT called RLPT and SCLPT. We developed four new transforms called LIPT, ILPT, LIT and NIT, which produce better results in terms of both compression ratio and execution times. The algorithms use a fixed amount of storage overhead in the form of a word dictionary for the particular corpus of interest and must be shared by the sender and receiver of the compressed files. Typical size of dictionary for the English language is about 1 MB and can be downloaded once along with application programs. If the compression algorithms are going to be used over and over again, which is true in all practical applications, the amortized storage overhead is negligibly small. We also develop efficient data structures to expedite access to the dictionaries and propose memory management techniques using caching for use in the context of the Internet technologies. Realizing that certain on-line algorithms might prefer not to use a pre-assigned dictionary, we have been developing new algorithms to obtain the transforms dynamically with no dictionary, with small dictionaries (7947 words and 10000 words) and studying the effect of the size of the dictionaries on compression performance. We call this family of algorithms M5zip.

During this reporting period, we developed a new transform engine StarNT for a multi-corpora text compression system. StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to recode each English word with a representation of no more than three symbols. This transform not only maintains most of the original context information at the word level, but also provides some kind of 'artificial' but strong context. The transform also exploits the distribution of lengths of the words and frequency to reduce the size of the transformed text which is provided to a backend compressor. Ternary search tree is used to store the transform dictionary in the transform encoder. This data structure provides a very fast transform encoding with a low storage overhead. Another novel idea of StarNT is to treat the transformed codewords asthe offset of words in the transform dictionary. Thus the time complexity of O(1) for searching a word in the dictionary is achieved in the transform decoder.

The transform encoder and transform decoder share the same dictionary, which is prepared off-line according to the following rules: 1) Most frequently used words (i.e. 312 words) are listed in the beginning of the dictionary according to their frequencies. 2) Other words are sorted according to their lengths and frequencies. Words with longer lengths are stored after words with shorter lengths. If two words have the same length, word with higher frequency of occurrence is listed after word with lower frequency. 3) To achieve better compression performance for backend data compressor, only letters [a..zA..Z] are used to represent the codeword.

One other angle of study is to adapt dynamically to domain-specific corpus (viz. biological, physics, computer science, XML documents, html documents). We experimentally measure the performance of our proposed algorithms and compare with all other algorithms using three corpuses: Calgary, Canterbury and Gutenberg corpus. Finally, we develop an information theory based explanation of the performance of our algorithms.

The major findings for this reporting period are as follows. We introduce StarNT, a multi-corpora text compression system, together with its transform engine StarNT. StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to recode each English word with a representation of no more than three symbols. This transform not only maintains most of the original context information at the word level, but also provides some kind of 'artificial' but strong context. The transform also exploits the distribution of lengths of the words and frequency to reduce the size of the transformed text which is provided to a backend compressor. Ternary search tree is used to store the transform dictionary in the transform encoder. This data structure provides a very fast transform encoding with a low storage overhead. Another novel idea of StarNT is to treat the transformed codewords as the offset of words in the transform dictionary. Thus the time complexity of O(1) for searching a word in the dictionary is achieved in the transform decoder.

The transform encoder and transform decoder share the same dictionary, which is prepared off-line according to the following rules: 1) Most frequently used words (i.e. 312 words) are listed in the beginning of the dictionary according to their frequencies. 2) Other words are sorted according to their lengths and frequencies. Words with longer lengths are stored after words with shorter lengths. If two words have the same length, word with higher frequency of occurrence is listed after word with lower frequency. 3) To achieve better compression performance for backend data compressor, only letters [a..zA..Z] are

used to represent the codeword.

Experimental results show that the average compression time has improved by orders of magnitude compared to our previous dictionary based transform LIPT and for large files,viz. 400Kbytes or more, the compression time is no worse than those obtained by bzip2 and gzip, and is much faster than PPMD. Meanwhile, the overhead in the decompression phase is negligible. We draw a significant conclusion that bzip2 in conjunction with this transform is better than both gzip and PPMD both in time complexity and compression performance.

One of the key features of StarNT compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. Results from five corpora show that StarZip achieves an average improvement in compression performance (in terms of BPC) of 13% over bzip2 -9, 19% over gzip -9, and 10% over PPMD. Further details about the StarNT transform and experimental results can be found in the attached document.

 

Project Impact

 

Contributions within Discipline: We expect that our research will impact the future status of information technology by developing data delivery systems with efficient utilization of communication bandwidths and archival storage. We have developed new lossless text compression algorithms that have improved compression ratio over the best known existing compression algorithms which might translate into a reduction of 75% text traffic on the Internet. We have developed an online compression utility software that will allow an user to submit any text file and obtain compression statistics of all the classical and new compression algorithms. The URL for this is: vlsi.cs.ucf.edu.
Contributions to Education and Human Resources: So far six Ph.D. students and three Masters students have participated and contributed in this research project, but not all of them received direct support from the grant. Dr. Robert Franceschini and Mr. Holger Kruse made contributions in the project before it was officially funded by NSF. A Masters student Ms. Fauzia Awan made significant contributions and successfully defended her thesis. A Masters student Mr. Raja Iqbal worked on this project for a brief period of time and collaborated with Ms. Awan in her reserach. Mr. Nitin Motgi worked on the M5Zip project Currently, four Ph. D. students ( Nan Zhang, Ravi Vijaya, Tao Tao and Weifeng Sun) are working on the project. Mr. Tao Tao who finished his Masters thesis three years ago has joined our reserach team. Other members of the M5 Research Group at the School of Electrical Engineering and Computer Science, Dr. Kunal Mukherjee and Mr.Piyush Jamkhandi, made critical comments and observation during the course of this work. The overall effect of these activities is to train graduate students with the current research on the forefront of technology.

Contributions to Resources for Science and Technology: We have taught (in the spring 2000 semester) a new course graduate level course entitled 'CAP5937:Multimedia Compression on the Internet'. The course was taught taught again spring of 2001 with a new number CAP5015. This has a new URL location: http://www.cs.ucf.edu/courses/cap5015/. This is a graduate level course and 14 students enrolled in the Spring 2000 semester and 11 students enrolled in the Spring 2001. The course was again offered Fall 2002 with 25 students. This particular topic has grown directly out of the research that we have been conducting for the last couple of years on data compression. Lecture topics have included both text and image compression,including topics from the research on the current NSF grant. The course has now being revised for next offering in Fall of 2003. The PI also delivered invited talks on research supported by this grant and in general on lossles text compression at universities in U.S. (University of California at Santa Barbara, San Diego, Riverside, Santa Cruz and Oregon State University) and abroad (Indian Institute of Technology, Kharagpur and Indian Statistical Institue , Kolkata during 2001). The PI also gave a demonstration of his work on data compression and the online compression utility website at the IDM Workshop, 2001, Ft. Worth, Texas ( April 29-30) sponsored by NSF.


Professor Amar Mukherjee has been invited to give a number of colloquium talks on text compression at several universities in California namely University of California at Santa Cruise, Riverside, Santa Barbara, San Diego, and Davis. He was also invited to give talks at IBM Almaden Research Center at San Jose, California, and Oregon State University at Corvallis, Oregon. Professor Mukherjee was also invited to give research seminars on text compression by the Indian Statistical Institute and the Indian Institute of Technology, Kharagpur.

 

We have introduced a graduate level course CAP 5015 entitled "Multimedia Compression on the Internet" (http://www.cs.ucf.edu/courses/cap5015/) based on the research we have been conducting on data compression. The course material includes both text and image compression, including content from research supported by current NSF grant.

 

Publications and Products

1.      F. Awan and Amar Mukherjee, "LIPT: A Lossless Text Transform to Improve Compression", Proceedinds of the International Conference on Information Technology:Coding and Communication (ITCC2000), vol. , (2001), p. 452. Published

2.      F. Awan and Amar Mukherjee, "LIPT: A Lossless Text Transform to Improve Compression", Proceedinds of the International Conference on Information Technology:Coding and Communication (ITCC2000), vol. , (2001), p. 452. Published

3.      N. Motgi and Amar Mukherjee, "Network Conscious Text Compression System (NCTCSys)", Proceedings of the International Conference on Information Technology:Coding and Computing (ITCC2001), vol. , (2001), p. 440. Published

4.      Fauzia Awan, Ron Zhang, Nitin Motgi,Raja Iqbal and Amar Mukherjee , "LIPT: A Reversible Lossless Text Transform to Improve Compression Performance", Proc. Data Compression Conferemce, vol. , (2001), p. 311. Published.

5.      M. Powell, Tim Bell, Amar Mukherjee and Don Adjeroh, "Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search", Proc. Data Compression Conference (Eds. J.A.Storer and M. Cohn), vol. , (2002), p. 112. Published.

6.      D. Adjeroh, A. Mukherjee, T. Bell, M. Powell and N.Zhang, "Pattern Matching in BWT-transformed Text", Proc. Data Compression Conference, vol. , (2002), p. 445. Published

7.      Weifeng Sun, Nan Zhang and Amar Mukherjee, "A Dictionary-Based Multi-corpora Text Compression System", Proceedings of the Data Compression Conference, vol. , (2003), p. 448. Published

8.      Weifeng Sun, Nan Zhang and Amar Mukherjee, "Dictionary-based Fast Text Transform for Text Compresion", Proceedings International Conference on Information Technology: Coding and Computing, vol. , (2003), p. 176. Published

Book(s) of other one-time publications(s):

Amar Mukherjee and Fauzia Awan, "Text Compression", bibl. 0-12-620861-1 Elsvier Science/Academic Press, (2003).(Ed. Khalid Sayood, "Lossless Compression Handbook Academic Press,2003)

 

Area Background

In the last decade, we have seen an unprecendent explosion of text information transfer through Emails, Web Browsing, and digital library and information retrieval systems. It is estimated that this growth will be 100% every year. In all of this text data competes for 45% of the total Internet traffic due to downloading of web pages, movements of emails, news groups, forums etc. With continuous increasing use of the Internet the efficient use of available resources, in particular hardware resources, has been a primary concern. One way of improving performance is by compressing text data using better compression methods without much degradation of encoding and decoding times and to ensure that as little data as possible is sent in response to the clientís request. We are developing software to integrate new compression schemes into a network conscious system which is capable of sensing the traffic on the current server on which these algorithms are hosted, and make an appropriate decision on what method should be used to compress the information before transmitting it.

Early papers that established this work are as follows:

 

Area References

1.      I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, 2nd Edition, Morgan Kaufmann Publishers, 1999.

2.      D. Salomon, Data Compression, 2nd Edition, Springer Verlag, 2000.

3.      K. Sayood, Introduction to Data Compression, 2nd Edition, Morgan Kaufmann, 2000.

4.      Using Compression on Webservers IIS 5.0 http://www.microsoft.com/TechNet/iis/httpcomp.asp

5.      Compression for HTML Streams http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html

 

Potential Related Projects

We developed compressed domain pattern matching algorithms based on sorted context and laid the foundation of a future research proposals in this field. The papers are concerned with searching for pattern in the compressed text without or only partial decompression. The sorted context of the BWT transform gives rise to very efficient binary and q-gram based search strategy for performing exact and approximate pattern matching operations. A SMALL ITR proposal to NSF based on this research foundation is currently under review.