Amar Mukherjee, Principal Investigator
School of Electrical Engineering and Computer ScienceUniversity of
Central Florida
Orlando, FL 32816
Contact Information
Contact Name of PI:
Amar Mukherjee
School of Electrical Engineering and Computer Science
University of Central Florida
Orlando, FL 32816
Phone: (407) 823-2763
Fax: (407) 823-5419
Email: amar@cs.ucf.edu
WWW PAGE
http://vlsi.cs.ucf.edu/datacomp/nfs/report/nfsreport.html
List of Supported Students
Nan Zhang, Graduate Research Assistant
Raja Iqbal, Graduate Research Assistant
Nitin Motgi, Graduate Research Assistant
Research Collaborators
Dr. Robert Franceschini
Mr. Holger Kruse
Project Award Information
Keywords
Data Compression, Lossless Compression, Text Compression, Wide-area
network, Caching.
Project Summary
The goal of this research project is to develop new lossless text compression
algorithms and software tools to incorporate compression in MIME/HTML standards.
The approach consists of encoding the text to exploit the natural redundancy
of a language via the use of a dictionary and then compressing it using
a pre-existing compression algorithm. The encoding scheme depends on the
specific characteristics of the compression algorithm and the Bzip2 algorithm
based on Burrows-Wheeler transform is used as the main backend compression
algorithm. A basic understanding of the interaction of the encoding schemes
and the compression algorithms is being developed. The performance of the
algorithms is being measured taking into account both compression and communication
metrics. Infrastructure tools are being developed using dynamic caching
of dictionaries to embed compression into MIME/HTML standards. The impact
of the research on the future of information technology is to develop data
delivery systems where communication bandwidth is at a premium and archival
storage is an exponentially costly endeavor. It is expected that the new
lossless text compression algorithms will have 3 to 30% improved compression
ratio over the best known pre-existing compression algorithms which might
translate into a reduction of more than 70% of the text traffic on the
Internet. The experimental research is linked to educational goals via
rapid dissemination of results via reports, conference and journal papers,
doctoral dissertation and master’s thesis, and transferring the research
knowledge into the graduate curriculum. Software tools developed under
this grant will be shared via a web site.
Goals and Objectives
The major goal of the project is to develop data delivery systems where
communication bandwidth is at a premium and archival storage is an exponentially
costly endeavor. Specific objectives for this period were:
Method of Approach
The basic philosophy of our compression algorithm is to transform the
text into some intermediate form, which can be compressed with better efficiency.
The transformation is designed to exploit the natural redundancy of the
language. We have developed a class of such transformations each giving
better compression performance over the previous one and all of them giving
better compression over most of the current and classical compression algorithms
(viz. Bzip2 (based on Burrows –Wheeler Transform), the class of PPM (Partial
Predicate Match) algorithms (such as PPMA, PPMC, PPMD, PPMZ and PPM*),
Gzip (based on LZW algorithms), Arithmetic and Huffman algorithms). We
also measured the execution times needed to produce the pre-processing
and its impact on the total execution time is negligible. The algorithm
has a fixed amount of storage overhead in the form of a word dictionary
for the particular corpus of interest and must be shared by all the users.
Typical size of dictionary for the English language is about 1 MB and can
be downloaded using caching and memory management techniques that are being
developed for use in the context of the Internet technologies. If the compression
algorithms are going to be used over and over again, which is true in all
practical applications, the amortized storage overhead is negligibly small
and is no more than the backend compression algorithm used after the transformation.
Our recommendation is to use the Bzip2 algorithm for backend compression
along with our proposed transformation. The combined effect is to produce
the best average compression ratio of 2.00 BPC (bits per character) over
the Calgary and Canterbury corpus with no appreciable degradation over
the execution time or storage overhead. We also have measured the impact
of using these algorithms on the transmission time and bandwidth over the
Internet for text file transfer using both http and ftp protocols. On the
average we established a 500% improvement of transmission time and 70%
reduction of bandwidth requirements for streamed text data.
The basic idea underlying the transformations that we invented is that
one can replace letters in a word by a special placeholder (*) and retain
at most two characters of the original word. Given an encoding, the
original word can be retrieved from a dictionary that contains a one-to-one
mapping between encoded words and original words. The encoding produces
an abundance of * characters in the transformed text making it the
most frequently occurring character. The transformed text can be compressed
better when all the available compression algorithms are applied except
in one case: the Bzip2 algorithm. For Bzip2, the run-length
encoding step at the beginning destroys the benefit of *-encoding. For
Bzip2, we propose three new transformations called Fixed Context Length
Preserving Transformation (LPT), Fixed Context Reverse LPT (RLPT) and shortened-context
LPT (SCLPT) all of which improve the compression performance and uniformly
beat the best of the available compression algorithms over an extensive
text corpus.
Targeted Activities, Summary of Results and Indication
of Success
Our activities have been directed to achieve results in all three areas
of goals and objectives stated earlier. The major results of our achievements
can be summarized as follows.
1. Performance Measurements of
the Compression Algorithms
We have implemented
the four lossless reversible text compression algorithms (*-encoded, LPT,
RLPT and SCLPT)
as discussed
above and measured their compression performance using a combination of
the Canterbury and Calgary
corpus (See
Table 1) to do our testing and compared them with the existing algorithms.
The results are summarized as
follows.
1. When
compared to the standard compression algorithms, the *-encoded compression
algorithms produce a
compression improvement of as high as 33% over Huffman and arithmetic algorithms,
about 10% over Unix
compress and 9% to 19% over Gzip algorithms and average 1.7% over Bzip2
and 0.3% over PPMD algorithms.
The last two algorithms are known to perform the best compression so far
in the literature. The improvement over
Bzip2 algorithms is not always guaranteed because of the effect that we
explained earlier. Fig. 1 illustrates typical
performance results.

Figure 1. Comparison of BPC on plrabn12.txt
(Paradise Lost) with original file and star converted file using
different compression algorithms.
2. The three new algorithms
that we proposed (LPT, RLPT and SCLPT) produce further improvement uniformly
over
the corpus. LPT has an average improvement of 4.4% on Bzip2 and 1.5% over
PPMD; RLPT has an average
improvement of 4.9% on Bzip2 and 3.4% over PPMD. The SCLPT has an average
improvement of 7.1% on Bzip2
and 3.8% over PPMD. The compression ratios are given in terms of average
BPC (bits per character) over our test
corpus. The results of comparison with Bzip2 and PPMD are shown only. As
can be seen from this table, our results
indicate the best average BPC of 2.147 which beats the best results available
in the literature. See Table 2, Fig.2 and
Fig.3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2. Average BPC over the corpus with Original, Star-encoded, LPT,
RLPT, and SCLPT using
Bzip2 and PPMD.
2. Execution Time Measurements
In making average timing measurements, we chose to compare with the
Bzip2 algorithm, which so far has given one of the best compression ratios
with lowest execution time. We also compare with the family of PPM algorithms,
which gives the best compression rations but are very slow.
Note the above times are only for encoding and one can afford to
spend more time off line encoding files, particularly if the difference
in execution time becomes negligibly small with increasing file size which
is true in our case. Our initial measurements on decoding times show no
significant differences.
When we compared the average execution times with PPMD, the results are as follows:
Figure 2. Compression BPC over the corpus with different encoding
scheme using Bzip2.
Figure 3. Compression BPC over the corpus with different encoding
schemes using PPMD.
Original:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Star-transform:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
RPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SCLPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 3. Statistics of compressions in different context order for alice.txt
with PPMD
2. Explanation of Observed Compression Performance
Recent research [A3][A4] shows that both the BWT and PPM algorithms
explore the representation of context in the text either implicitly or
explicitly. The longer and more deterministic the context is, i.e., higher
order context, the higher is the probability estimation of the next predicted
character leading to better compression ratio. All our transforms aim to
create such contexts. Table 3 gives the statistics of the context for the
sample text “alice.txt” using PPMD, as well as our four transforms along
with PPMD. The first column is the order of the context. The second is
the number of input bytes encoded in that order. The last row is the summation
of all the bytes compressed at different context levels, namely, the file
size. The third column is the cumulative percentage of column two. The
fourth is the final number of bits used to encode all the input in that
order. The fifth column shows the cumulative frequency of column four.
The sixth is the corresponding bytes for bits statistics. The seventh is
the compression ration of that order and the last column shows the BPC
in that order.
Table 3 shows that length preserved transforms (*, LPT and RLPT) result
in a higher percentage for high order contexts than the original PPMD algorithm.
The advantage is accumulated in the different orders to gain an overall
improvement. Although SCLPT has a smaller value of high order context compressions,
it is compensated by compression on the deterministic contexts beforehand
that could be in a higher order with long word length. We are continuing
to research the precise relationship between the statistics of different
order context and the overall compression ratio.
3. Compressed Data Delivery via the Internet
Due to the increased growth in the traffic (refer to Table 4
from the NSF Website) there is a need to compress data transferred over
the network. If not, growth would lead to congestion, drowning the Internet
with the amount of data in transition. Web browsing, FTP transfers
and electronic mail generate most of the traffic on internet, which transfer
basic text code as payload (viz., HTML, plain/text part in mails) with
additional payloads of different formats (like attachments, inline images,
word processing documents, etc.). One of the objectives of our project
is to minimize the part of the basic payload that is plain text.
Reduction in text data transfer over the Internet will help in using the
bandwidth for other types of data (like video and audio) transferred on
the Internet. In our research, we incorporated new lossless text compression
algorithms into file transfers that are made over some of the standard
network protocols e.g., FTP, HTTP, POP3 and SMTP. These algorithms will
be implemented as a part of the standard protocols defined at Session-Application
layer over the transport layer in the OSI Model. The compressed file is
transferred over the TCP/IP Network (official protocol for the Internet).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 4. Current and Projected transfer over Internet in Gigabits/sec
(The data transfer include normal text, image, video and audio)
We are in the process of testing the above method for different compression/decompression methods, which have performed 3% to 30% better than standard compression algorithms (viz. Huffman, arithmetic, LZW, DMC, PPM*, BWT) under different network conditions and server loads. We will be evaluating several performance metrics like payload transform time (includes transformation and compression), payload transfer time and payload reverse transform time (includes decompression and reverse transformation). During the transfers there are different other parameters that will be evaluated, such as server performance, bandwidth use over high and low bandwidths. These tests will reflect the overall performance in the transfer of files over the Internet. We will be also looking into designing a protocol for managing a dictionary on the clients. This management should incorporate some of the features like dictionary update and propagation, different dictionaries for different languages and should also have versioning capabilities.
As for the initial stages in this approach we have procured all the material needed and have built the infrastructure that is necessary to start. We are developing a real-time compression test facility, which will be capable of testing encoding/compression algorithms for the above-mentioned parameters and for different standard protocols like FTP, HTTP, POP3 and SMTP. This facility will log different parameters mentioned above plus some miscellaneous information, which can be later used to evaluate the performance of the compression algorithms on text transfers over Internet.
The basic test facility has been developed to perform Internet transfers using FTP. The results obtained from this are very promising. The transfers were initiated by normal FTP clients from people on the same network as well as from different networks. These networks are at different distances from the location where the server runs. The compression method is employed on the server and we used the Bzip2 algorithm for compression as a starting point without any of the transforms RPT, LPT or SCLPT. Results of the test runs are shown in Table 5.
As we can see in Table 5, the size of the files transferred is in the
range of 0.3KB to 500KB. Our method always shows improvement in the transfer
even if we include time involved in compressing the data. In one example,
the size of the uncompressed data transferred is 3420.72 Kbytes and time
taken to transfer that amount of data is 458.057 sec. When the same data
is compressed and transferred, total data transferred is 743.097 Kbytes
and time taken is 50.17sec. The time taken to compress 3420.72 Kbytes of
data is 5.971sec (server was assumed to slightly loaded). Hence, 2677.63Kbytes
of data transfer and (458.057 – 50.17 – 5.971) sec = 401.916sec was saved.
If we used any of the transforms mentioned earlier we will spend some amount
of time in applying the transform but will gain in reducing the transmission
time since data is much more compressed
|
|
|
|
|
(Transformation Time, Compression Time) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 5: Payload transformation and transmission using Bzip2 compression algorithm at the server. Three stations were used: LAB located at UCF, FSC stands for Florida SolarEnergy Center, and IND is a place located in India. N= not compressed, C=Compressed.
Project Impact
We expect that our research will impact the future status of information
technology by developing data delivery systems where communication bandwidths
is at a premium and archival storage is an exponentially costly endeavor.
We have developed new lossless text compression algorithms that have 3
to 30% improved compression ratio over the best known pre-existing compression
algorithms which might translate into a reduction of more than 70% text
traffic on the Internet. We are developing software tools to include
compression in standard Internet protocols, and we are planning to automate
dynamic caching interaction with web browsers, e-mail programs and authoring
tools.
We have taught (in the spring 2000 semester) a new course entitled
"Multimedia Compression on the Internet": http://www.cs.ucf.edu/courses/cap5937-01/.
This is a graduate level course with 14 students enrolled in the Spring
2000 semester. This particular topic has grown directly out of the
research that we have been conducting for the last couple of years on data
compression. Lecture topics have included both text and image compression,
including topics from the research on the current NSF grant.
So far three Ph.D. students and five Masters students have participated
and contributed in this research project, but not all of them received
direct support from the grant. Dr. Robert Franceschini and Mr. Holger Kruse
made significant contributions in the project before it was officially
funded by NSF. Currently, one Ph. D. student (Nan Zhang ) and two Masters
Students (Nitin Motgi and Raja Iqbal) are working on the project. Another
Master student (Fauzia Awan) did a term project in the graduate course
“Multimedia Compression on the Internet” directly related to this research
project. Other members of the M5 Research Group at the School of Electrical
Engineering and Computer Science, Dr. Kunal Mukherjee, Tao Tao, and Piyush
Jamkhandi, made critical comments and observation during the course of
this work. The overall effect of these activities is to train graduate
students with the current research on the forefront of technology.
Project References
This work was started several years ago; two early papers that established
this work are the following:
1. R. Franceschini and A. Mukherjee. "Data Compression
Using Encrypted Text", Proceedings of the Third Forum on Research and Technology,
Advances in
Digital Libraries, ADL96, May 13-15
1996, pp 130-138.
2. H. Kruse and A. Mukherjee. "Data Compression Using
Text Encryption", Proceedings of the Data Compression Conference,
1997, IEEE Computer Society
Press, pp. 447.
We have submitted a journal version of a paper [1] and two other papers
[2,3] are under preparation for submissions to conferences in the near
future. Copies of these papers will be made available via our project website
and also by hard copy
1. R. Franceschini, H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee,
“Lossless, Reversible Transformations that Improve Text Compression Ratios”,
submitted to
IEEE Transactions on Multimedia Systems (June 2000).
2. F. Awan, R. Iqbal and A. Mukherjee, “ A New Text Preprocessing Algorithm
for Bzip2 and PPM*” (in preparation).
3. N. Motgi and A. Mukherjee, “ High Speed Text Data Transmission over
Internet Using Compression Algorithm” (in preparation)
Area Background
In the last decade of the 20th century we have experienced an unprecedented
increase in the amount of digital data transmitted across public networks,
in particular across the Internet. Recent standards such as the World Wide
Web, HTTP, HTML and image standards have provided free access to a vast
amount of information available from public servers. With the continuously
increasing use of the Internet the efficient use of available resources,
in particular hardware resources, has become a primary concern. One way
of improving performance is by compressing and caching data, i.e. by ensuring
that data is not sent more often than necessary, that as little data is
sent in response to each request as possible, and that all data transmitted
across networks is compressed during transmission, to increase the efficiency
by which the existing infrastructure is utilized. This proposal is
concerned with integrating new data compression and caching techniques
for improving the efficiency of bandwidth-limited Internet. We propose
new algorithms to preprocess text in order to improve the compression ratio
of textual documents, in particular online documents such as web pages
on the World Wide Web. We also develop similar techniques for images
that do not change with time. We describe how our algorithm can be efficiently
used to aid in the compression of documents consisting of large texts and
static images, for transmission across public data networks such as the
Internet, using existing standards such as HTTP and HTML. The compression
algorithms also can be used to utilize archival storage effectively.
Our approach integrates the two basic ideas of compression and caching
into a single mechanism, i.e., data is automatically cached whenever possible,
avoiding retransmission altogether, and if data has to be transmitted,
it is compressed in a very efficient way, which surpasses the performance
of compression algorithms that are widely used at this time.
Area References
A1. I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes.
2nd Ed, Morgan Kaufmann, 1999. Managing Gigabytes Web Site
A2. K. Sayood. Introduction to Data Compression. 2nd Ed,
Morgan Kaufmann, 2000. Introduction to Data Compression Web Site
A3. J. Cleary, W. Teahan, Unbounded Length Contexts for PPM, The Computer
Journal, Vol. 36, No. 5, 1993. (Also see Proc. Data Compression Conference,
Snowbird, Utah, 1997)
A4. Michelle Effros, PPM Performance with BWT Complexity: A New Method
for Lossless Data Compression, Proc. Data Compression Conference, Snowbird,
Utah, March, 2000
Potential Related Projects
There is an ongoing research project under the supervision of the Principal
Investigator on Wavelet Based Image Compression and Transmission supported
by Intel Corporation. A research proposal on this subject is under preparation
for submission to NSF in the near future.