Algorithms to Improve the Efficiency of Data                                     
Compression and Caching on Wide Area Networks
 

Amar Mukherjee, Principal Investigator
School of Electrical Engineering and Computer ScienceUniversity of Central Florida
Orlando, FL 32816

Contact Information
Contact Name of PI:
Amar Mukherjee
School of Electrical Engineering and Computer Science
University of Central Florida
Orlando, FL 32816
Phone:  (407) 823-2763
Fax:  (407) 823-5419
Email: amar@cs.ucf.edu

WWW PAGE
http://vlsi.cs.ucf.edu/datacomp/nfs/report/nfsreport.html

List of Supported Students
Nan Zhang,  Graduate Research Assistant
Raja Iqbal,   Graduate Research Assistant
Nitin Motgi, Graduate Research Assistant

Research Collaborators
Dr. Robert Franceschini
Mr. Holger Kruse

Project Award Information


Keywords
Data Compression, Lossless Compression, Text Compression, Wide-area network, Caching.
 

Project Summary
The goal of this research project is to develop new lossless text compression algorithms and software tools to incorporate compression in MIME/HTML standards. The approach consists of encoding the text to exploit the natural redundancy of a language via the use of a dictionary and then compressing it using a pre-existing compression algorithm. The encoding scheme depends on the specific characteristics of the compression algorithm and the Bzip2 algorithm based on Burrows-Wheeler transform is used as the main backend compression algorithm. A basic understanding of the interaction of the encoding schemes and the compression algorithms is being developed. The performance of the algorithms is being measured taking into account both compression and communication metrics. Infrastructure tools are being developed using dynamic caching of dictionaries to embed compression into MIME/HTML standards. The impact of the research on the future of information technology is to develop data delivery systems where communication bandwidth is at a premium and archival storage is an exponentially costly endeavor. It is expected that the new lossless text compression algorithms will have 3 to 30% improved compression ratio over the best known pre-existing compression algorithms which might translate into a reduction of more than 70% of the text traffic on the Internet. The experimental research is linked to educational goals via rapid dissemination of results via reports, conference and journal papers, doctoral dissertation and master’s thesis, and transferring the research knowledge into the graduate curriculum. Software tools developed under this grant will be shared via a web site.
 

Goals and Objectives
The major goal of the project is to develop data delivery systems where communication bandwidth is at a premium and archival storage is an exponentially costly endeavor. Specific objectives for this period were:


Method of Approach
The basic philosophy of our compression algorithm is to transform the text into some intermediate form, which can be compressed with better efficiency. The transformation is designed to exploit the natural redundancy of the language. We have developed a class of such transformations each giving better compression performance over the previous one and all of them giving better compression over most of the current and classical compression algorithms (viz. Bzip2 (based on Burrows –Wheeler Transform), the class of PPM (Partial Predicate Match) algorithms (such as PPMA, PPMC, PPMD, PPMZ and PPM*), Gzip (based on LZW algorithms), Arithmetic and Huffman algorithms). We also measured the execution times needed to produce the pre-processing and its impact on the total execution time is negligible.  The algorithm has a fixed amount of storage overhead in the form of a word dictionary for the particular corpus of interest and must be shared by all the users. Typical size of dictionary for the English language is about 1 MB and can be downloaded using caching and memory management techniques that are being developed for use in the context of the Internet technologies. If the compression algorithms are going to be used over and over again, which is true in all practical applications, the amortized storage overhead is negligibly small and is no more than the backend compression algorithm used after the transformation. Our recommendation is to use the Bzip2 algorithm for backend compression along with our proposed transformation. The combined effect is to produce the best average compression ratio of 2.00 BPC (bits per character) over the Calgary and Canterbury corpus with no appreciable degradation over the execution time or storage overhead. We also have measured the impact of using these algorithms on the transmission time and bandwidth over the Internet for text file transfer using both http and ftp protocols. On the average we established a 500% improvement of transmission time and 70% reduction of bandwidth requirements for streamed text data.
The basic idea underlying the transformations that we invented is that one can replace letters in a word by a special placeholder (*) and retain at most two characters of the original word.  Given an encoding, the original word can be retrieved from a dictionary that contains a one-to-one mapping between encoded words and original words.  The encoding produces an abundance of  * characters in the transformed text making it the most frequently occurring character. The transformed text can be compressed better when all the available compression algorithms are applied except in one case:  the Bzip2 algorithm.  For Bzip2, the run-length encoding step at the beginning destroys the benefit of *-encoding. For Bzip2, we propose three new transformations called Fixed Context Length Preserving Transformation (LPT), Fixed Context Reverse LPT (RLPT) and shortened-context LPT (SCLPT) all of which improve the compression performance and uniformly beat the best of the available compression algorithms over an extensive text corpus.

 Targeted Activities, Summary of Results and Indication of Success
Our activities have been directed to achieve results in all three areas of goals and objectives stated earlier. The major results of our achievements can be summarized as follows.
    1. Performance Measurements of  the Compression Algorithms
          We have implemented the four lossless reversible text compression algorithms (*-encoded, LPT, RLPT and SCLPT)
          as discussed above and measured their compression performance using a combination of the Canterbury and Calgary
          corpus (See Table 1) to do our testing and compared them with the existing algorithms. The results are summarized as
          follows.

          1. When compared to the standard compression algorithms, the *-encoded compression algorithms produce a
              compression improvement of as high as 33% over Huffman and arithmetic algorithms, about 10% over Unix
              compress and 9% to 19% over Gzip algorithms and average 1.7% over Bzip2 and 0.3% over PPMD algorithms.
              The last two algorithms are known to perform the best compression so far in the literature. The improvement over
              Bzip2 algorithms is not always guaranteed because of the effect that we explained earlier. Fig. 1 illustrates typical
              performance results.
 
 

     Figure 1. Comparison of BPC on plrabn12.txt (Paradise Lost) with original file and star converted file using
                     different compression algorithms.

        2. The three new algorithms that we proposed (LPT, RLPT and SCLPT) produce further improvement uniformly over
            the corpus. LPT has an average improvement of 4.4% on Bzip2 and 1.5% over PPMD; RLPT has an average
            improvement of 4.9% on Bzip2 and 3.4% over PPMD. The SCLPT has an average improvement of 7.1% on Bzip2
            and 3.8% over PPMD. The compression ratios are given in terms of average BPC (bits per character) over our test
            corpus. The results of comparison with Bzip2 and PPMD are shown only. As can be seen from this table, our results
            indicate the best average BPC of 2.147 which beats the best results available in the literature. See Table 2, Fig.2 and
            Fig.3.
 

Filename
Filesize
1musk10.txt 
1344739
alice29.txt 
152087
anne11.txt 
586960
asyoulik.txt 
125179
bib 
111261
book1 
768771
book2 
610856
crowd13
777028
dracula 
863326
franken 
427990
ivanhoe 
1135308
lcet10.txt 
426754
mobydick 
987597
news 
377109
paper1 
53161
paper2 
82199
paper3 
46526
paper4 
13286
paper5 
11954
paper6 
38105
plrabn12.txt 
481861
twocity 
760697
world95.txt
 2736128
 
                                                        Table 1. Files in the test corpus and their sizes.
 
 
 
Bzip2
PPMD
Original
2.411
2.229
*-encoded
2.377
2.223
LPT
 2.311
 2.195
RLPT
 2.300
 2.155
SCLPT
 2.251
 2.147
 

                           Table 2. Average BPC over the corpus with Original, Star-encoded, LPT, RLPT, and SCLPT using
                                          Bzip2 and PPMD.
 

2. Execution Time Measurements
In making average timing measurements, we chose to compare with the Bzip2 algorithm, which so far has given one of the best compression ratios with lowest execution time. We also compare with the family of PPM algorithms, which gives the best compression rations but are very slow.


Note the above times are only for encoding and one can afford to spend more time off line encoding files, particularly if the difference in execution time becomes negligibly small with increasing file size which is true in our case. Our initial measurements on decoding times show no significant differences.

When we compared the average execution times with PPMD, the results are as follows:

3. Memory Usage Estimation:
For memory usage of the programs, the *-transform, LPT and RLPT need to load two dictionaries with 55K bytes each, requiring about 110K bytes and for SCLPT, it takes about 90K bytes because of a smaller dictionary size. It is claimed that Bzip2 uses 400K + (8 ? Blocksize) for compression. We used the –9 option, i.e., 900K of block size for the test. So, we need about 7600K. For PPM, it is programmed as about 5100K + file size. So the *-transform family takes insignificant overhead compared to Bzip2 and PPM in memory usage. It should be pointed out that all the above programs have not yet been well optimized. There is potential for less time and smaller memory usage.
 
 




Figure 2. Compression BPC over the corpus with different encoding scheme using Bzip2.
 
 


 Figure 3. Compression BPC over the corpus with different encoding schemes using PPMD.



                      Original:
 

Order
Count
%
Bits
%
Bytes
Ratio
BPC
5
114279
0.751
226297
0.682
28287
4.040
1.980
4
18820
0.875
47015
0.823 
5876
3.202
2.498
3
12306
0.956
35077
0.929
4384
2.807
2.850
2
5395
0.992
18118
0.984
2264
2.382
3.358
1
1213
1.000
4972
0.999
621
1.952
4.099
0
75
1.000
465
1.000
58
1.290
6.200
 
152088 
331944
         

                     Star-transform:
 

Order
Count
%
Bits
%
Bytes
Ratio
BPC
5
134377
0.860
280203
0.857
35025
3.837
2.085
4
10349 
0.926 
22823
 0.927
 2852 
3.628 
2.205
3
6346 
0.967 
12911 
0.966 
1613 
3.932 
2.035
2
3577 
0.989 
6791 
0.987 
848 
4.214 
1.899
1
1579 
0.999 
3789 
0.999 
473 
3.334 
2.400
0
79 
1.000 
409 
1.000 
51 
1.545 
5.177
 
156307
326926
         

                         LPT
 

Order
Count
%
Bits
%
Bytes
Ratio
BPC
5
123537 
0.790 
250873 
0.769 
31359
 3.939 
2.031
4
13897 
0.879 
29749 
0.860 
3718 
3.737 
2.141
3
10323 
0.945 
21804 
0.927 
2725 
3.788
 2.112
2
6548 
0.987 
15861 
0.976 
1982 
3.303 
2.422
1
1923 
0.999 
7459 
0.998 
932 
2.062 
3.879
0
79 
1.000 
513 
1.000 
64 
1.232
6.494
 
156307 
326259
         

                        RPT
 

Order
Count
%
Bits
%
Bytes
Ratio
BPC
5
124117 
0.794 
247458 
0.767 
30932 
4.013 
1.994
4
12880 
0.876 
26738 
0.850 
3342 
3.854 
2.076
3
10114 
0.941 
20159 
0.912 
2519 
4.014 
1.993
2
7199 
0.987 
20312 
0.975 
2539 
2.835 
2.822
1
1918 
0.999 
7549 
0.999 
943 
2.033 
3.936
0
79 
1.000
 474 
1.000 
59 
1.333 
6.000
 
156307 
322690
         

                       SCLPT
 

Order
Count
%
Bits
%
Bytes
Ratio
BPC
5
113141 
0.777 
235115 
0.735 
29389 
3.850 
2.078
4
15400 
0.883 
42847 
0.869 
5355 
2.875 
2.782
3
8363 
0.940 
17225 
0.922 
2153 
3.884 
2.060
2
6649 
0.986 
17140 
0.976 
2142 
3.103 
2.578
1
1943 
0.999 0
7248 
0.999 
906 
2.145 
3.73
0
79 
1.000 
458 
1.000 
57 
1.380 
5.797
 
145575 
320033
         

                             Table 3. Statistics of compressions in different context order for alice.txt with PPMD
 

2. Explanation of Observed Compression Performance
Recent research [A3][A4] shows that both the BWT and PPM algorithms explore the representation of context in the text either implicitly or explicitly. The longer and more deterministic the context is, i.e., higher order context, the higher is the probability estimation of the next predicted character leading to better compression ratio. All our transforms aim to create such contexts. Table 3 gives the statistics of the context for the sample text “alice.txt” using PPMD, as well as our four transforms along with PPMD. The first column is the order of the context. The second is the number of input bytes encoded in that order. The last row is the summation of all the bytes compressed at different context levels, namely, the file size. The third column is the cumulative percentage of column two. The fourth is the final number of bits used to encode all the input in that order. The fifth column shows the cumulative frequency of column four. The sixth is the corresponding bytes for bits statistics. The seventh is the compression ration of that order and the last column shows the BPC in that order.
Table 3 shows that length preserved transforms (*, LPT and RLPT) result in a higher percentage for high order contexts than the original PPMD algorithm. The advantage is accumulated in the different orders to gain an overall improvement. Although SCLPT has a smaller value of high order context compressions, it is compensated by compression on the deterministic contexts beforehand that could be in a higher order with long word length. We are continuing to research the precise relationship between the statistics of different order context and the overall compression ratio.
 

3. Compressed Data Delivery via the Internet
 Due to the increased growth in the traffic (refer to Table 4 from the NSF Website) there is a need to compress data transferred over the network. If not, growth would lead to congestion, drowning the Internet with the amount of data in transition.  Web browsing, FTP transfers and electronic mail generate most of the traffic on internet, which transfer basic text code as payload (viz., HTML, plain/text part in mails) with additional payloads of different formats (like attachments, inline images, word processing documents, etc.). One of the objectives of our project is to minimize the part of the basic payload that is plain text.  Reduction in text data transfer over the Internet will help in using the bandwidth for other types of data (like video and audio) transferred on the Internet. In our research, we incorporated new lossless text compression algorithms into file transfers that are made over some of the standard network protocols e.g., FTP, HTTP, POP3 and SMTP. These algorithms will be implemented as a part of the standard protocols defined at Session-Application layer over the transport layer in the OSI Model. The compressed file is transferred over the TCP/IP Network (official protocol for the Internet).
 

Year
Data
1998
588
1999
1,572
2000
4,451
2001
11,328
2003
27,645

                                          Table 4. Current and Projected transfer over Internet in Gigabits/sec
                                             (The data transfer include normal text, image, video and audio)
 

We are in the process of testing the above method for different compression/decompression methods, which have performed 3% to 30% better than standard compression algorithms (viz. Huffman, arithmetic, LZW, DMC, PPM*, BWT) under different network conditions and server loads. We will be evaluating several performance metrics like payload transform time (includes transformation and compression), payload transfer time and payload reverse transform time (includes decompression and reverse transformation). During the transfers there are different other parameters that will be evaluated, such as server performance, bandwidth use over high and low bandwidths.  These tests will reflect the overall performance in the transfer of files over the Internet.  We will be also looking into designing a protocol for managing a dictionary on the clients. This management should incorporate some of the features like dictionary update and propagation, different dictionaries for different languages and should also have versioning capabilities.

As for the initial stages in this approach we have procured all the material needed and have built the infrastructure that is necessary to start. We are developing a real-time compression test facility, which will be capable of testing encoding/compression algorithms for the above-mentioned parameters and for different standard protocols like FTP, HTTP, POP3 and SMTP. This facility will log different parameters mentioned above plus some miscellaneous information, which can be later used to evaluate the performance of the compression algorithms on text transfers over Internet.

The basic test facility has been developed to perform Internet transfers using FTP. The results obtained from this are very promising. The transfers were initiated by normal FTP clients from people on the same network as well as from different networks. These networks are at different distances from the location where the server runs. The compression method is employed on the server and we used the Bzip2 algorithm for compression as a starting point without any of the transforms RPT, LPT or SCLPT.  Results of the test runs are shown in Table 5.

As we can see in Table 5, the size of the files transferred is in the range of 0.3KB to 500KB. Our method always shows improvement in the transfer even if we include time involved in compressing the data. In one example, the size of the uncompressed data transferred is 3420.72 Kbytes and time taken to transfer that amount of data is 458.057 sec. When the same data is compressed and transferred, total data transferred is 743.097 Kbytes and time taken is 50.17sec. The time taken to compress 3420.72 Kbytes of data is 5.971sec (server was assumed to slightly loaded). Hence, 2677.63Kbytes of data transfer and (458.057 – 50.17 – 5.971) sec = 401.916sec was saved.  If we used any of the transforms mentioned earlier we will spend some amount of time in applying the transform but will gain in reducing the transmission time since data is much more compressed
 
FileName
Station Name
Size in KBytes
Type (N or C)
Payload Transformation Time
(Transformation Time, Compression Time)
Transfer Time in sec
testfile1.txt 
 LAB 
481.862 
0.000, 0.000 
35.456
testfile1.txt 
LAB 
145.076 
0.000, 1.448
 4.079
testfile1.txt 
LAB 
145.076 
0.000, 0.000 
4.349
testfile1.txt 
LAB
481.862 
0.000, 0.000 
36.185
Server.c 
LAB 
2.061 
0.000, 0.043 
0.371
Server.c 
LAB 
4.547 
0.000, 0.000 
1.114
Ftpcomp.c 
LAB 
11.097 
0.000, 0.000 
0.877
Ftpcomp.c 
LAB 
3.719 
0.000, 0.069 
0.184
Logfile.c 
LAB 
0.377 
0.000, 0.010 
0.571
Logfile.c 
LAB 
0.655
 N 
0.000, 0.000 
1.255
testfile1.txt 
ENG 
145.076 
0.000, 1.348 
10.990
testfile1.txt 
ENG 
481.862 
0.000, 0.000 
70.028
server.c 
ENG 
4.547 
0.000, 0.000 
1.798
server.c 
ENG 
2.061 
0.000, 0.046 
0.588
ftpcomp. 
ENG 
11.147 
0.000, 0.000 
4.156
ftpcomp.c 
ENG 
3.719 
0.000, 0.072 
0.532
server.c 
FSC 
2.061
 C 
0.000, 0.042 
0.481
server.c 
FSC 
4.547 
0.000, 0.000
0.775
ftpcomp.c 
FSC 
3.719 
0.000, 0.069 
0.734
ftpcomp.c 
FSC 
11.147 
0.000, 0.000 
0.910
testfile1.txt 
FSC 
 481.862 
0.000, 0.000 
16.916
testfile1.txt 
FSC 
145.076 
0.000, 1.368 
4.859
testfile2.txt 
FSC 
963.723 
0.000, 0.000 
145.164
testfile1.txt 
IND 
145.076 
0.000, 1.456 
22.432
testfile1.txt 
IND 
481.862 
0.000, 0.000 
143.423

Table 5: Payload transformation and transmission using Bzip2 compression algorithm at the server. Three stations were used: LAB located at UCF, FSC stands for Florida SolarEnergy Center, and IND is a place located in India. N= not compressed, C=Compressed.

Project Impact
We expect that our research will impact the future status of information technology by developing data delivery systems where communication bandwidths is at a premium and archival storage is an exponentially costly endeavor. We have developed new lossless text compression algorithms that have 3 to 30% improved compression ratio over the best known pre-existing compression algorithms which might translate into a reduction of more than 70% text traffic on the Internet.  We are developing software tools to include compression in standard Internet protocols, and we are planning to automate dynamic caching interaction with web browsers, e-mail programs and authoring tools.
We have taught (in the spring 2000 semester) a new course entitled "Multimedia Compression on the Internet": http://www.cs.ucf.edu/courses/cap5937-01/. This is a graduate level course with 14 students enrolled in the Spring 2000 semester.  This particular topic has grown directly out of the research that we have been conducting for the last couple of years on data compression.  Lecture topics have included both text and image compression, including topics from the research on the current NSF grant.
So far three Ph.D. students and five Masters students have participated and contributed in this research project, but not all of them received direct support from the grant. Dr. Robert Franceschini and Mr. Holger Kruse made significant contributions in the project before it was officially funded by NSF. Currently, one Ph. D. student (Nan Zhang ) and two Masters Students (Nitin Motgi and Raja Iqbal) are working on the project. Another Master student (Fauzia Awan) did a term project in the graduate course “Multimedia Compression on the Internet” directly related to this research project. Other members of the M5 Research Group at the School of Electrical Engineering and Computer Science, Dr. Kunal Mukherjee, Tao Tao, and Piyush Jamkhandi, made critical comments and observation during the course of this work.  The overall effect of these activities is to train graduate students with the current research on the forefront of technology.
 

Project References
This work was started several years ago; two early papers that established this work are the following:
  1. R. Franceschini and A. Mukherjee.  "Data Compression Using Encrypted Text", Proceedings of the Third Forum on Research and Technology, Advances in
      Digital Libraries, ADL96, May 13-15 1996, pp 130-138.
  2. H. Kruse and A. Mukherjee.  "Data Compression Using Text Encryption", Proceedings of the   Data Compression Conference, 1997, IEEE Computer Society
      Press, pp. 447.

We have submitted a journal version of a paper [1] and two other papers [2,3] are under preparation for submissions to conferences in the near future. Copies of these papers will be made available via our project website and also by hard copy
1. R. Franceschini, H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee, “Lossless, Reversible Transformations that Improve Text Compression Ratios”, submitted to
    IEEE Transactions on Multimedia Systems (June 2000).
2. F. Awan, R. Iqbal and A. Mukherjee, “ A New Text Preprocessing Algorithm for Bzip2 and PPM*” (in preparation).
3. N. Motgi and A. Mukherjee, “ High Speed Text Data Transmission over Internet Using Compression Algorithm” (in preparation)

Area Background
In the last decade of the 20th century we have experienced an unprecedented increase in the amount of digital data transmitted across public networks, in particular across the Internet. Recent standards such as the World Wide Web, HTTP, HTML and image standards have provided free access to a vast amount of information available from public servers. With the continuously increasing use of the Internet the efficient use of available resources, in particular hardware resources, has become a primary concern. One way of improving performance is by compressing and caching data, i.e. by ensuring that data is not sent more often than necessary, that as little data is sent in response to each request as possible, and that all data transmitted across networks is compressed during transmission, to increase the efficiency by which the existing infrastructure is utilized.  This proposal is concerned with integrating new data compression and caching techniques for improving the efficiency of bandwidth-limited Internet.  We propose new algorithms to preprocess text in order to improve the compression ratio of textual documents, in particular online documents such as web pages on the World Wide Web.  We also develop similar techniques for images that do not change with time. We describe how our algorithm can be efficiently used to aid in the compression of documents consisting of large texts and static images, for transmission across public data networks such as the Internet, using existing standards such as HTTP and HTML. The compression algorithms also can be used to utilize archival storage effectively.  Our approach integrates the two basic ideas of compression and caching into a single mechanism, i.e., data is automatically cached whenever possible, avoiding retransmission altogether, and if data has to be transmitted, it is compressed in a very efficient way, which surpasses the performance of compression algorithms that are widely used at this time.
 

Area References
A1. I.H. Witten, A. Moffat, and T.C. Bell.  Managing Gigabytes.  2nd Ed, Morgan Kaufmann, 1999.   Managing Gigabytes Web Site
A2. K. Sayood.  Introduction to Data Compression.  2nd Ed, Morgan Kaufmann, 2000.  Introduction to Data Compression Web Site
A3. J. Cleary, W. Teahan, Unbounded Length Contexts for PPM, The Computer Journal, Vol. 36, No. 5, 1993. (Also see Proc. Data Compression Conference,
      Snowbird, Utah, 1997)
A4. Michelle Effros, PPM Performance with BWT Complexity: A New Method for Lossless Data Compression, Proc. Data Compression Conference, Snowbird,
      Utah, March, 2000

Potential Related Projects
There is an ongoing research project under the supervision of the Principal Investigator on Wavelet Based Image Compression and Transmission supported by Intel Corporation. A research proposal on this subject is under preparation for submission to NSF in the near future.