Scalable indexing for large-scale distributed storage systems
Overview
An object storage system stores objects (files) identified by keys. An object contains raw, uninterpreted data, and the key specifies its location (address). The Scality object storage system, known as The Ring, is highly distributed, uses replication and erasure coding for fault-tolerance, supports hierarchical archival storage, and is designed for high throughput and low latency. Its design leverages the separation of data files from metadata (the key-to-location database): data is essentially immutable, while adding, removing and updating objects is done by modifying the meta-data. Thanks to this design, The Ring is able to store petabytes of data, updated at a very high rate.
In order to support fast search, the system requires an indexing sub-system that tracks the inverse mapping from content (patterns of data or tags) to key. Indexing a large, distributed, heavily-updated storage system is a difficult problem.
Research topic
The research problem to be solved in this PhD is to design a flexible distributed indexing and search system that scales to petabytes, and that remains up-to-date incrementally as the store is updated.
Some specific requirements include the following:
- The index itself is to be split into metadata and data, stored separately, preferably in the primary metadata/data storage areas.
- Index computation shall leverage the in-disk computation capabilities of modern programmable IP drives [ 22 ].
- Similarly, query resolution should execute partly on the metadata ring and partly on the data drives themselves.
- The logic will therefore be heavily decentralized and distributed.
- The indexing subsystem shall track updates (adding, removing or modifying files) incrementally as they occur.
- The memory requirements for the index shall remain small (in the area of 0.1x to 3x data size, depending on use case).
The PhD student will design and implement a decentralised indexing and querying system that solves the above problem efficiently.
State of the art
Formally, the problem of index search is related to subset pattern matching, and more specifically to the distributed pattern matching problem (DPM) [ 1 ].
Existing centralized algorithms are not suitable to the more peer-to-peer environment of an object storage system [ 2 ].
Many distributed indexing algorithms exist in the literature, e.g., inverted indexes [ 5, 6, 7, 15 ], hypercubes [ 8 ] and unit vector matching [ 9 ]. These are often designed for Distributed Hash Tables (DHTs), and focus on efficient DHT routing, for instance using Bloom filters [ 10, 11 ] or erasure coding with super-peers [ 1, 21 ]. Conversely, algorithms based on skip-lists do not require a DHT, but instead use multi-level indexing mechanisms based on hierarchies of lists [ 12, 13, 14 ].
Alternatively, there exists a variety of file systems optimized for indexing [ 16, 17, 18, 19 ]. In contrast, the desired approach should not depend on any specific file system and should integrate into the object store at a low level.
Advisors
The research is advised jointly by Marc Shapiro and Vianney Rancurel (Dir. of Research of Scality), in cooperation with Pr. Umut Acar.
It is funded by a "bourse CIFRE" (joint industry-academia research).
References
- [ 1 ] R. Ahmed. "Efficient and Flexible earch in Large Scale Distributed Systems". PhD Thesis, University of Waterloo, 2007.
- [ 2 ] R. Cole and R. Harihan. Tree pattern matching and subset matching in randomized o(n log 3 m) time. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 66–75, 1997.
- [ 3 ] A. Amir, E. Porat, and M. Lewenstein. Approximate subset matching with don’t cares. In Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 305–306, 2001.
- [ 4 ] R. Cole, R. Hariharan, and P. Indyk. Tree pattern matching and subset matching in deterministic o(n log3 n)-time. In Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 245–254, Philadelphia, PA, USA, 1999.
- [ 5 ] M. Harren, J. M. Hellerstein, R. Huebsch, B. T. Loo, S. Shenker, and I. Stoica. Complex queries in DHT-based Peer-to-Peer networks. In Proceedings of International Workshop on Peer-to-Peer Systems (IPTPS), pages 242–259, 2002.
- [ 6 ] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A distributed anonymous information storage and retrieval system. Lecture Notes in Computer Science (LNCS), 2009:46–66, 2001.
- [ 7 ] L. Liu, K. D. Ryu, and K. Lee. Supporting efficient keyword-based file search in Peer-to-Peer file sharing systems. In Proceedings of GLOBECOM, 2004.
- [ 8 ] Y. Joung, L. Yang, and C. Fang. Keyword search in DHT-based Peer-to-Peer networks. IEEE Journal on Selected Areas in Communications (JSAC), 25(1):46–61, January 2007.
- [ 9 ] C. Tang, Z. Xu, and M. Mahalingam. pSearch: information retrieval in structured overlays. ACM SIGCOMM Computer Communication Review, 33(1):89–94, 2003.
- [ 10 ] M. Li, W. Lee, and A. Sivasubramaniam. Neighborhood signatures for searching P2P networks. In Proceedings of Seventh International Database Engineering and Applications Symposium (IDEAS), pages 149–159, 2003.
- [ 11 ] S. Rhea and J. Kubiatowicz. Probabilistic location and routing. In Proceedings of IEEE INFOCOM, 2002.
- [ 12 ] N. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A. Wolman. SkipNet: A scalable overlay network with practical locality properties. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.
- [ 13 ] J. Aspnes and G. Shah. Skip graphs. In Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 384–393, 2003.
- [ 14 ] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen. Making Peer-to-Peer keyword searching feasible using multi-level partitioning. In Proceedings of International Workshop on Peer-to-Peer Systems (IPTPS), pages 151–161. Springer, 2004.
- [ 15 ] N. Lester, A. Moffat, and J. Zobel. Fast on-line index construction by geometric partitioning. In Proceedings of the 2005 International Conference on Information and Knowledge Management Systems (CIKM ’05), pages 776–783, November 2005.
- [ 16 ] D. Giampalo. Practical File System Design with the Be File System. Morgan Kaufmann, 1st edition, 1999.
- [ 17 ] Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the 2006 USENIX Annual Technical Conference, pages 43–56, Boston, MA, 2006.
- [ 18 ] Udi Manber and Sun Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Technical Conference, 1994.
- [ 19 ] Yu Hua, Hong Jiang, Yifeng Zhu, Dan Feng, and Lei Tian. SmartStore: A new metadata organization paradigm with metadata semantic-awareness for next-generation file systems. In Proceedings of SC09, Portland, OR, November 2009.
- [ 20 ] Andrew W. Leung. Organizing, indexing and searching large-scale file systems. Technical Report UCSC-SSRC-09-09. University of California, Santa Cruz. December 2009.
- [ 21 ] R. Ahmed and R. Boutaba. "Plexus: a scalable peer-to-peer protocol enabling efficient subset search." Networking, IEEE/ACM Transactions on 17.1 (2009): 130-143.
- [ 22 ] https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki
Marc.Shapiro =at= acm.org
Last modified: Mon Sep 24 15:32:21 CEST 2018