ALGORITHMS TO IMPROVE THE EFFICIENCY OF DATA COMPRESSION AND CACHING ON WIDE-AREA NETWORKS

Amar Mukherjee (PI), Robert Franceschini(Co-PI)
School of Computer Electrical Engineering and Computer Science
University of Central Florida
Orlando, Fl..32816

Contact Information
Amar Mukherjee
School of Computer Electrical Engineering and Computer Science
University of Central Florida
Orlando, Fl..32816
Phone: 407-823-2763
Fax : 407-823-5419
Email: amar@cs.ucf.edu
Personal Home Page

WWW PAGE

Project Home Page

List of Supported Students and Staff

Raja Iqbal, Graduate Research Assistant
Nan Zhang, Graduate Research assistant

Project Award Information

Keywords

data compression, lossless text compression, bandwidth limited network, archival storage

Project Summary

The goal of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression in MIME/HTML
standards. The approach consists of encoding the text to exploit the natural redundancy of a language via the use of a dictionary and then compressing it
using a pre-existing compression algorithm. The encoding scheme depends on the specific characteristics of the compression algorithm and the Bzip2
algorithm based on Burrows-Wheeler transform is used as the main backend compression algorithm. A basic understanding of the interaction of the
encoding schemes and the compression algorithms is being developed. The performance of the algorithms is being measured taking into account both
compression and communication metrics. Infrastructure tools are being developed using dynamic caching of dictionaries to embed compression into
MIME/HTML standards. The impact of the research on the future of information technology is to develop data delivery systems where communication
bandwidth is at a premium and archival storage is an exponentially costly endeavor. It is expected that the new lossless text compression algorithms will
have 5 to 10% improved compression ratio over the best known pre-existing compression algorithms which might translate into a reduction of more than
50% of the text traffic on the Internet. The experimental research is linked to educational goals via rapid dissemination of results via reports, conference and
journal papers, doctoral dissertation and masters thesis, and transferring the research knowledge into the graduate curriculum. Software tools developed
under this grant will be shared via a web site.
 
 

Goals, Objectives, and Targeted Activities

 First Year
 

  1. Development of new encoding schemes and compression algorithms and measurement of performance of these algorithms to compare against the  existing algorithms, including Bzip2 and PPM* algorithms. In making these  measurements, we plan to separate out the benefits of using a pre-existing algorithm from the benefits specific to our proposed new algorithm. We expect of the order of 5 to 10% improvement of compression ratio over Bzip2 and PPM algorithms which are the best known algorithms now.
  2. Begin theoretical studies  for basic understanding of the proposed new Bzip2  encoding scheme and its possible derivatives. Extend possible generalization of our algorithm to more than one character context offered by  the Burrows-Wheeler transform.  We expect to discover the relationship  of our specific encoding scheme to the established mathematical foundation  of the Burrows-Wheeler transform in terms of contextual dependence.
  3. Work with graduate students to write papers for possible presentation  in conferences and publication in journals, conduct regular research group discussions and develop a new course on the impact of compression  algorithms on multimedia technology.
  4. Write and submit annual progress report.


 Second Year
 

  1. Continue to work on the development of basic understanding of the proposed  new compression algorithms. At this stage, we expect to have papers written  proposing a general theory of dictionary based compression and the encoding  schemes for the Bzip2 method including more context information than what is provided  by the BWT algorithm.
  2. Infrastructure development: develop tools for including compression in  MIME/HTML, to automate dynamic caching, and to interact with web browsers,
  3. e-mail programs and authoring tools. Inclusion of compression in existing  standards may reduce the text traffic by over 50% which will have an impact on the performance. The performance metrics will include not only the raw  compression ratio but will also include communication time (viz., download and uploading time, proxy server time, etc.) and  bandwidth utilization  (especially in the context of low-bandwidth lines such as those between  Internet Service Providers and their clients, etc.).   We expect to complete 50% of these tools during Year 2.
  4. Continue to work with graduate students to write papers for presentation  in conferences and publications in journals and conduct research group discussions on multimedia compression technology.
  5. Write and submit annual progress report.
Indication of Success

We have nearly accomplished the first goal of the first year. The results are posted in this web page (Project Results Page). We have discovered two new modeling schemes for pre-processing the input text and obtained  uniformly better results in all the  text corpus that we tried. The improvement beats the PPM* and the BWT algorithms in their native forms and show 3% to 30% better compression ratios over all of the compression techniques (viz., Huffman, arithmetic, LZW, DMC, PPM, PPM*, BWT).

We have started work on the second objective which is to understand mathematically and theoretically why the modeling schema developed under
our research project is performing  better than all other algorithms.

Project Impact

We expect that our research will impact the future status of information technology by developing data delivery systems where communication bandwidths
is at a premium and archival storage is an exponentially costly endeavor. We expect to develop new lossless text compression algorithms that will have 5
to 10% improved compression ratio over the best known pre-existing compression algorithms which might translate into a reduction of more than 50% text traffic on the Internet.  We expect to develop software tools to include compression  in MIME/HTML standards, to automate dynamic caching, and to
interact with web browsers, e-mail programs and authoring tools.

We have started (in the Spring 2000 semester) a new course entitled "Multimedia Compression on the Internet".  The course web page is as follows: Course Web Page.  This is a graduate level course with 14 students enrolled in the Spring 2000 semester.  This particular topic has grown directly out of the research that we have been conducting for the last couple of years on data compression.  Lecture topics have included both text and image compression, including topics from the research on the current NSF grant.

Project References

This work was started several years ago; two early papers that established this work are the following:

  1. R. Franceschini and A. Mukherjee.  "Data Compression Using Encrypted Text", Proceedings of the Third Forum on Research and Technology, Advances in Digital Libraries, ADL96, May 13-15 1996, pp 130-138.
  2. H. Kruse and A. Mukherjee.  "Data Compression Using Text Encryption", Proceedings of the Data Compression Conference, 1997, IEEE Computer Society Press, pp. 447.
A conference and journal paper based on the current work are currently in preparation.

Area Background

In the last decade of the 20th century we have experienced an unprecedented increase in the amount of digital data transmitted across public networks,
in particular across the Internet. Recent standards such as the World Wide Web, HTTP, HTML and image standards have provided free access to a vast
amount of information available from public servers.

With the continuously increasing use of the Internet the efficient use of available resources, in particular hardware resources, has become a primary
concern. One way of improving performance is by compressing and caching data, i.e. by ensuring that data is not sent more often than necessary, that as little data is sent in response to each request as possible, and that all data transmitted across networks is compressed during transmission, to
increase the efficiency by which the existing infrastructure is utilized.

This proposal is concerned with integrating  new data compression and caching techniques for improving the efficiency of bandwidth-limited Internet.
We propose new algorithms to preprocess text in order to improve the compression ratio of textual documents, in particular online documents such as web pages on the World Wide Web.  We also develop similar techniques for images that do not change with time. We describe how our algorithm can be efficiently used to aid in the compression of documents consisting of large texts and static images, for transmission across public data networks such as the Internet, using existing standards such as HTTP and HTML. The compression algorithms also can be used to utilize archival storage effectively.

Our approach integrates the two basic ideas of compression and caching into a single mechanism, i.e. data is automatically cached whenever possible,
avoiding retransmission altogether, and if data has to be transmitted, it is compressed in a very efficient way, which surpasses the performance of
compression algorithms which are widely used at this time.
 

Area References

  1. I.H. Witten, A. Moffat, and T.C. Bell.  Managing Gigabytes.  2nd Ed, Morgan Kaufmann, 1999.   Managing Gigabytes Web Site
  2. K. Sayood.  Introduction to Data Compression.  2nd Ed, Morgan Kaufmann, 2000.  Introduction to Data Compression Web Site


Potential Related Projects

Another area of active interest within our research group at UCF is image and video compression using wavelets.


Back to the IDM '99 homepage