Scalable High-dimensional Index Design for Code Search Systems
- Scalable High-dimensional Index Design for Code Search Systems
- Mu-Woong Lee
- Date Issued
- This research addresses the problem of supporting scalable code similarity search systems for large-scale software repositories. While there are commercial code search engines available, they treat software as text and often fail to find semantically related code. Meanwhile, existing tools for semantic code clone searches take a “post-mortem” approach involving the detection of clones “after” the code development is completed, and hence, fail to return the results instantly. In clear contrast, the goal of this research is to combine the strength of these two lines of existing research.To achieve this goal, an indexing structure on vector abstractions of code is proposed. This index utilizes dimension reduction techniques to efficiently deal with the vector abstractions, which are naturally high-dimensional. This search system is then integrated into real-world development sessions. Such integration suggests that, by posing every code segment as a query to the software code corpus, developers can instantly reference relevant code segments at the time of generation to enhance productivity. This integration scenario creates the need for efficient similarity searches with the following requirements. First, a developer session translates into a sequence of evolving queries that need to be efficiently supported. Second, the quality of the results needs to be controlled, e.g., dealing with licenses requires that there be no false negatives. To satisfy these requirements, a workload-aware striping framework for high-dimensional evolving queries is proposed. This framework can be used to boost most existing high-dimensional indexes. In addition, to further enhance the scalability of code search systems, a workload-balancing distributed indexing structure is proposed. The goal of existing efforts in distributed indexing has been the localization of queries to data residing at a small number of nodes (i.e., locality-preserving indexing) to minimize communication cost. However, considering that workloads often correlate with data locality, such indexing often generates hotspots. Hence, workload-balancing is proposed as an optimization goal, and a distributed index that evenly distributes the workload is presented.
- Article Type
- Files in This Item:
- There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.