Algorithms
to Improve the Efficiency of Data Compression and Caching on Wide-Area Networks
Amar Mukherjee (Principal
Investigator)
Department
of Computer Science
Amar
Mukherjee
School
of Electrical Engineering and Computer Science,
University
of Central Florida,
Orlando,
FL-32816.
Phone:
(407) 823-2763
Fax:
(407) 823-5419
Email:
amar@cs.ucf.edu
http://vlsi.cs.ucf.edu/director.html
http://vlsi.cs.ucf.edu/nsf.html
Nan
Zhang, Graduate Research Assistant
Fauzia
Salim Awan, Graduate Research Assistant
Nitin
Jeevan Motgi, Graduate Research Assistant
Tao
Tao, Graduate Research Assistant
· Ïh
Award
number: IIS-9977336
· Ïh
Duration:
Period of performance of entire project, 10/01/1999-9/30/2002
· Ïh
Current
Reporting Period: 10/01/2001-4/15/2002
· Ïh
Title:
Algorithms To Improve the Efficiency of Data Compression and Caching on
Wide-Area Networks.
Data
compression, text transformation. ÏÈ
dictionary , ÏÈ lossless text
compression, bandwidth, Star encoding, LIPT, DBT, bzip2, PPMD, ÏÈ GZIP, compressed domain pattern matching.
The objective of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression for text transmissions with efficient utilization of communication bandwidth and for archival storage. The approach consists of exploiting the natural redundancy of language by encoding text into an intermediate form that increases the context for compression. The encoding scheme uses dictionaries to substitute words in text with transformed words. Theoretical understanding of the interaction of encoding schemes and compression algorithms like bzip2 and PPM is being developed. Algorithm performance is measured in terms of compression metrics such as compression ratio, encoding and decoding times and transmission metrics such as available bandwidth, traffic congestion and server load. Dictionary management protocol for managing dictionaries used by our compression algorithms is also being developed. The experimental research is linked to educational goals via rapid dissemination of results via reports, conference and journal papers, doctoral dissertation and master thesis and transferring the research knowledge into the graduate curriculum. Software tools developed under this grant will be shared via a website.
The major purpose of this project is to develop lossless text compression algorithms that can be used in data delivery systems for efficient utilization of communication bandwidth as well as archival storage. The specific objectives are to develop new text compression algorithms along with basic understanding of the interaction of encoding schemes and compression algorithms, measurement of performance of the algorithms taking into account both compression and communication metrics and software tools to incorporate compression in text transmission over the Internet and for archival storage.
ÏÈFirst
Year: During this period, we
developed several transformations for pre-processing the text to make it more
compressible by existing algorithms. We proposed three transformations called
Star Encoding where a word is replaced by ‘*’ characters and at most two
characters of the original word, and three variations of Star encoding called
Fixed Context Length Preserving Transformation (LPT), Fixed Context Reverse LPT
(RLPT) and shortened-context LPT (SCLPT) All of these transforms improve the
compression performance and uniformly beat almost all of the best of the
available compression algorithms over an extensive text corpus. Along with
compression ratios, we also made measurements on performance of these
algorithms in terms of encoding and decoding time and storage overhead.
Second
Year : We
developed a new text transformation called LIPT (Length Index Preserving
Transformation). In LIPT, the length
of the input word and the offset of the words in the dictionary are denoted
with letters. LIPT achieves some compression at the preprocessing stage as well
and retains enough context and redundancy for the compression algorithms to
give better results. During this
period we also developed infrastructure and tools to integrate the new text
compression into Web servers, Mail Servers and News servers. Corresponding
clients for specific applications were created as part of tool development. All
of this resulted in making bandwidth utilization more efficient and reducing
the time to transfer text by about 60% on average. We wrote several papers for
presentation in conferences. We conducted group discussions and wrote annual
progress reports.
Third
Year: ÏÈSeveral extensions of
the LIPT transform called NIT and LIT and a dissertation has been completed
incorporating these results. We also developed a new text transformation called
DBT (Dynamic Block Transformation) based on LZW that dynamically output the
offset of a word in a dictionary whose initial size is only 5 to 7 thousand
most frequently used words (shared between the sender and receiver). The
dictionary grows dynamically. If a word is not present in the dictionary it is passed without
transformation to the output and an entry of the word is made at the end of the
dictionary, so that future occurrence of the same word will be transformed. If
a word is encountered in the text, it moves to the first address of the
dictionary.The transformation consists of generating an address or offset of
the word in the dictionary expressed in a variable length text string in
base-26 (since there are 26 lower case symbols in English alphabet) number
system. The scheme also handles special characters, capitals, punctuation marks
and end of line marker in a way that minimizes the use of special symbols and
enhances context for compressibility. We incorporate this idea into a program
called M5zip which also uses distance coding (DC) and LZ algorithm to encode
the transformed address sequence.
We also worked on compressed domain pattern matching
problem based on sorted context and laid the foundation of a future research
proposal in this field. The work was carried on with collaboration with an
international team. Two papers based on our work have been presented in an
international conference.
We
have discovered a new modeling scheme LIPT (Length Index Preserving Transform)
for pre-processing the input text. This scheme is more efficient in providing
faster and better compression than earlier schemes of Star encoding, LPT, RLPT
and SCLPT. This scheme uniformly obtains better result in all text corpuses
that we tried (reduction in file size varies between 1 to 20% using new
schemes). The average reduction in filesize achieved by LIPT over the corpus is
9.47%. LIPT+bzip2 outperforms original bzip2 by 5.3%, LIPT+PPMD over PPMD by
4.5% and LIPT+GZIP over gzip by 6.7%. We also compare our method with
Word-based Huffman; our method achieves average bpc of 2.169 over the corpus as
compared to 2.506 achieved by using Word-Huffman, an improvement of 13.45%.
Transmission time improvement for the transfer of corpus is 1.97% with
LIPT+gzip2 over original gzip2, 5.91% with LIPT+bzip2 over bzip2. Transmission
using LIPT+bzip2 is 42.90% faster than simple gzip used as current standard for
compression.Our results with DBT show that the overall average BPC for
the Canterbury corpus using bzip2 is 2.36 bpc, and using bzip2 with DBT is 2.21
bpc, a 6.35% improvement. M5Zip (DC and DBT) achieves bpc of 2.09, an 11.4%
improvement in compression over bzip2. Standard gzip without DBT gives 2.69 bpc
while with DBT gives 2.46 bpc, an 8.5% improvement in compression. Results are
comparable with LIPT. The improvement comes with storage overhead whose
amortized cost is negligible. This overhead is incurred in sharing a common
dictionary between the two parties who are exchanging compressed files.
Dictionary size is drastically reduced to around 43KB in DBT from ~1MB in
LIPT (95.70% decrease in dictionary size). Dr. Mark Nelson, the Editor of Dr.
Dobb’s journal is writing a feature ÏÈ
article about Star Encoding in an upcoming ( May 2002) issue of the
magazine. Star encoding was the first in a series of transformation that we
discovered for data compression. The publication of an unsolicited feature
article in a popular magazine by a well known author and a researcher is an
indication of acceptance of our work by the data compression community. ÏÈ Our work on compressed domain pattern
matching using sorted context has been reported in two publications in the Data
Compression Conference ( Snowbird, Utah, April 2-5, 2002).
This
project will have impact on data delivery systems such as Web servers, Mail
servers, and News servers where transferring text data is primary concern. The
algorithms will also be useful in storing massive amounts of archival text data
where 10 to 20% improvement in compression ratio will translate into thousands
of gigabytes of saved storage. A major portion of text data can be compressed
and transmitted resulting in efficient utilization of bandwidth of the
Internet. Currently, two ÏÈ Ph. D. student
(Nan Zhang, Tao Tao) and three ÏÈ Masters
Students (Nitin Motgi, Vikas Tambda and Jaebok Park) are working on the
project. Prof. Mukherjee has been teaching a graduate level course CAP 5015 - Multimedia
Compression on the Internet (http://www.cs.ucf.edu/courses/cap5015/). The course material includes
both text and image compression, including content from research supported by
current NSF grant. The course is now incorporated into the regular graduate
courses in the Computer Science curriculum.
1.
E.
Franceschini and A. Mukherjee. “Data Compression Using Encrypted Text”
Proceedings of the Third Forum on
Research and Technology, Advances in Digital Libraries, ADL96, May 13-15
1996, PP 130-138.
2.
H.
Kruse and A. Mukherjee. “Data Compression Using Text Encryption”, Proceedings
of the Data Compression Conference,
1997, IEEE Computer Society Press, pp. 447.
3.
H.
Kruse and A. Mukherjee. Preprocessing Text to improve Compression Ratios.
Proceedings of Data Compression
Conference, 1998, IEEE Computer Society Press 1997, pp. 556.
We have submitted a journal paper and seven
conference papers of which three have been accepted as of January 2001. Copies
of these papers will be made available via our project website and also by hard
copy.
1.
R.
Franceschini, H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee, “Lossless,
Reversible Transformations that Improve Text Compression Ratios”, submitted to
IEEE Transactions on Multimedia Systems (June 2000).
2.
F.
Awan, and A. Mukherjee, “ LIPT: A Lossless Text Transform to Improve
Compression", Proceedings of International
Conference on Information and Theory: Coding and Computing, IEEE Computer
Society, Las Vegas Nevada, April 2001.
3.
N.
Motgi and A. Mukherjee, “Network Conscious Text Compression Systems (NCTCSys)”,
Proceedings of International Conference
on Information and Theory: Coding and Computing, IEEE Computer Society, Las
Vegas Nevada, April, 2001.
4.
F.
Awan, Nan Zhang N. Motgi, R. Iqbal and A. Mukherjee, “LIPT: A Reversible
Lossless Text Transform to Improve Compression Performance”, Proceedings of Data Compression Conference,
Snowbird, Utah, March, 2001.
5.
F.
Awan, Nan Zhang, N. Motgi and A. Mukherjee, “Data Compression: A key for
Digital Archival Storage and Transmission” submitted for Joint Conference on Digital Libraries (JCDL) sponsored by Association for Computing Machinery, Washington
D.C., June 24-28, 2001.
6.
F. Awan, R. Iqbal and A.
Mukherjee, “Length Index Preserving Transform: A Lossless Reversible Text
Preprocessor for Bzip2 and PPM”, submitted to IEEE International Symposium on Information Theory, Washington, D.C.
2001.
7.
Amar
Mukherjee and Fauzia Awan, “Text Compresion”, ( A chapter in a book entitled “
Data Compression Handbook” edited by Khalid Sayood, Academic Press, publication
date July, 2002).
8.
M.
Powell, Tim Bell, Amar Mukherjee and Don Adjeroh, ÏÈ “ Searching BWT Compressed Text with the Boyer-Moore Algorithm
and Binary Search”, Proc. Data Compression Conference, Snowbird, Utah, April
2-4, 2002, pp.112-121.
9.
Don
Adjeroh, Amar Mukherjee, Tim Bell , Matt Powell and Nan Zhang, Pattern,
“Matching in BWT Transformed ÏÈ Text”,
Proc. Data Compression Conference, Snowbird, Utah, April 2-4, 2002, p. 445 (
Poster paper).
10.
Fauzia
S. Awan, “Lossless Reversible Text Transforms”, M.S. thesis, University of
Central Florida, Orlando, Florida. Summer, 2001.
In the last decade, we have seen an unprecendent explosion of text information transfer through Emails, Web Browsing, and digital library and information retrieval systems. It is estimated that this growth will be 100% every year. In all of this text data competes for 45% of the total Internet traffic due to downloading of web pages, movements of emails, news groups, forums etc. With continuous increasing use of the Internet the efficient use of available resources, in particular hardware resources, has been a primary concern. One way of improving performance is by compressing text data faster and better using better compression methods and to ensure that as little data as possible is sent in response to the client’s request. We propose to integrate new compression schemes into a network conscious system which is capable of sensing the traffic on the current server on which these algorithms are hosted, and make an appropriate decision on what method should be used to compress the information before transmitting it.
ÏÈ
1.
I.H.
Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. 2nd Edition,
Morgan Kaufmann Publishers, 1999.
2.
D.
Salomon,”Data Compression”, Springer Verlag, 2nd edition, 2000.
3.
K.
Sayood. Introduction to Data Compression. 2nd Ed. Morgan Kaufmann,
2000.
4.
Using
Compression on Webservers IIS 5.0 http://www.microsoft.com/TechNet/iis/httpcomp.asp
5.
Compression
for HTML Streams http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
A research project on hardware implementation of the
BWT compression algorithm on FPGA has been completed in collaboration with a
research team in Germany (Technical University of Darmstadt). The work is being
continued to develop a complete hardware ÏÈ
architecture for the bzip2 compression algorithm using the backend MTF
(Move-to-front) encoding and Huffman coding. A research proposal is under
consideration for possible submission to funding agencies.