Collaborative:
Compressed Domain Search for Text and Images by Sorted Contexts
Project Award Number IIS-0207819
– University of Central Florida
IIS-0228370 – West Virginia University.
Amar Mukherjee
(Principal Investigator)
Department of Computer Science
School of Electrical Engineering and Computer Science,
University of Central Florida,
Orlando, FL-32816.
Phone: (407) 823-2763
Fax: (407) 823-5419
Email: amar@cs.ucf.edu
http://vlsi.cs.ucf.edu/director.html
Prof. Donald A. Adjeroh (Co-Principal Investigator)
Lane Department of Computer Science and Engineering
West Virginia University
Morgantown, WV 26506-6109304-293 - 0405 Ext. 2567Email: adjeroh@csee.wvu.eduhttp://www.csee.wvu.edu/~adjeroh/
Head, Department of Computer Science
University of Canterbury
Christchurch
New Zealand
Phone: +64 3 364-2352
Fax: +64 3 364-2569
Email: tim@cosc.canterbury.ac.nz
http://www.cosc.canterbury.ac.nz/~tim/
http://vlsi.cs.ucf.edu/nsfCollaborativeCompressedIDM2003x2.html
Nan Zhang, Graduate Research Assistant
Weifeng Sun, Graduate Research Assistant
Ravi Vijaya Satya, Graduate Research Assistant
Tao Tao, Graduate Research Assistant
Yong Zhang, Graduate Research Assistant
Jalumuri Nandakishore, Graduate Research Assistant
Matt Powell, Research Assistant
Andrew Firth, Research Assistant
· Duration: Period of performance of entire project, 6/01/2002-/30/2005
· Current Reporting Period: 6/1/2002-5/30/2003
Data compression, lossless text compression, compressed domain pattern matching, Burrows-Wheeler transform
With the dramatic increase in the amount of multimedia data, efficiency considerations make it important to consider ways to keep the data in the compressed form for as much as possible, even when it is searched. Our goal is to developed methodologies for searching directly on compressed text and images, when the compression is based on the family of compression algorithms that depend on sorted context. We also plan to develop search-aware data compression schemes for text and images based on sorted context that will support compressed-domain search directly on the compressed data. For text, we will explore the applicability of the classical linear exact pattern matching algorithms and binary search algorithms for text compressed with the Burrows-Wheeler transform. We will develop similar compression schemes for images based on sorted image contexts.
Targeted Activities and Findings
We developed compressed pattern matching algorithms that achieve the complexity bounds of linear exact pattern matching algorithms for text compressed with the sorted context approach and also algorithms based on binary and q-gram search of sorted context generated by the Burrows-Wheeler transform. We studied the feasibility and utility of introducing error prediction (as used in context-based lossless image compression) at one or more stages of the compression process. We also started preliminary work on compressed domain approximate pattern matching.
We develop two techniques for searching BWT transformed text using Boyer-Moore algorithm and binary search. The first technique applies the Boyer-Moore algorithm on characters that match while they are decompressed and skips the part of the compressed file that cannot possibly lead to any match. The second technique is based on the observation that the BWT transform contains a sorted list of all substrings of the original string, which can be exploited for rapid searching using a variant of binary search. Both techniques are faster than decompress-then-search approach for small number of queries, and binary search is even faster for large number of queries. The attached file gives further details of our approach.
The sorted context of the BWT transformed text also forms the basis of a pattern search algorithm which uses the q-grams of the pattern against the sorted q-grams of the text. The decoder only has limited information about the sorted context, but fast q-gram (context) generation and matching algorithms have been developed to exploit this with the help of auxiliary index arrays built in linear time. The algorithm (we call it the QGRAM algorithm) first computes the index arrays for the correspondence between F, the first column of the sorted cyclic matrix in BWT and T. All exact matches are grouped in consecutive rows of the sorted matrix, which makes binary searches on F and/or q-grams of the matrix very fast. We also developed a new algorithm, called QGREP, which improved on the sublinear search time of the binary search and QGRAM algorithms.
We have started working on ways to extend the algorithms for approximate pattern matching, especially the k-mismatch problem, and the k-approximate matching problem.
For
image compression, we have been studying algorithms for adaptive scanning-path
for BWT-based lossless image compression. The methods use image statistics to
predict the activity in the image. Based on this, and the nature of transformed
output from the BWT, the algorithms determine the scanning path to use for the
given part of the image. This provides adaptability in the scanning path
without the time consuming problem of explicit edge detection or image
segmentation.
The work will lead to new and efficient algorithms for searching multimedia data in compressed form and development of search-aware compression algorithms. Several Ph.D. and Masters students have participated and contributed in this research project, but not all of them received direct support from the grant. Individual Co-PIs meet with graduate students at their respective universities on a regular basis to discuss research problems. The students acquire the necessary skills to search literature and carry on an in-depth study and research in a field. The students are also asked to make presentations on their work. This gives the students experience of teaching graduate level courses and seminar presentations. The overall effect of these activities is to train graduate students with the current research on the forefront of technology. Each one of them acquired valuable experience in undertaking significant programming tasks.
The compressed pattern matching problem is to perform pattern matching directly on compressed or partially compressed data. The motivation for this work includes potential reduction of delay in response time compared to decompress-then-search approach, and the possible elimination of time to compress data back. Another motivation is the reduction of storage usage and speed up in processing smaller amount of data. The search-aware algorithms are concerned with introducing additional structure in existing or new data compression algorithms to make the data more easily searchable and retrievable at a later time. This is a relatively new field of research and few references relevant to the topic are given below.
1. M. Amir, G. Benson and M. Farach, “Let Sleeping Files Lie: Pattern Matching in Z compressed Files”, Journal of Computer and System Sciences, Vol. 52, pp.299-306,1996.
2. T. Bell , D. Adjeroh and A. Mukherjee, “Pattern Matching in Compressed Text and Images”, Technical Report, Department of Computer Science, University of Canterbury, New Zealand, 2001.
3. U. Manber, “ A Text compression scheme that allows fast searching directly in compressed file”, ACM Transactions on Information Systems, Vol.52, N0.1, pp.124-136, 1997.
4. I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, 2nd Edition, Morgan Kaufmann Publishers, 1999.
5. Alistair Moffat and Andrew Turpin, “Compression and Coding Algorithms”, Kluwer Academic Publishers, 2002
6. D. Salomon, Data Compression, 2nd Edition, Springer Verlag, 2000.
7. K. Sayood, Introduction to Data Compression, 2nd Edition, Morgan Kaufmann, 2000.
8. Using Compression on Webservers IIS 5.0 http://www.microsoft.com/TechNet/iis/httpcomp.asp
9. Compression for HTML Streams http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
A highly relevant and related research grant proposal by the same investigators and collaborators entitled “ITR Collaborative Research: Compressed Search and Retrieval for Very Large Text and Image Repositories” (Proposal Number 0312724) has been recommended for an award by the CISE Directorate of NSF as of 7/25/03.