General Information
    • ISSN: 1793-8201 (Print), 2972-4511 (Online)
    • Abbreviated Title: Int. J. Comput. Theory Eng.
    • Frequency: Quarterly
    • DOI: 10.7763/IJCTE
    • Editor-in-Chief: Prof. Mehmet Sahinoglu
    • Associate Editor-in-Chief: Assoc. Prof. Alberto Arteta, Assoc. Prof. Engin Maşazade
    • Managing Editor: Ms. Cecilia Xie
    • Abstracting/Indexing: Scopus (Since 2022), INSPEC (IET), CNKI,  Google Scholar, EBSCO, etc.
    • Average Days from Submission to Acceptance: 192 days
    • APC: 800 USD
    • E-mail: editor@ijcte.org
    • Journal Metrics:
    • SCImago Journal & Country Rank
Article Metrics in Dimensions

IJCTE 2022 Vol.14(1): 15-19 ISSN: 1793-8201
DOI: 10.7763/IJCTE.2022.V14.1305

Research on Runtime Query Optimization Technology of Spark SQL

Yong Zhao and Rong Chen

Abstract—Spark SQL uses SQL to describe the task of data analysis and optimizes it according to the theory of query optimization, which effectively improves the efficiency of execution. However, the query optimization of Spark SQL still has the following shortcomings at present. It requires the operator to collect statistics information explicitly through the collection commands of statistics information. In addition, because the collected statistics information is not accurate enough, the optimization effect will be poor. To solve the above problems, this paper proposes an algorithm that collects statistics at runtime and optimizes the query adaptively. The algorithm uses Bloom Filter Pruning to prune data that does not meet the join conditions before a join operation is executed. In order to estimate the cardinality of the intermediate relationship of the join more accurately, the algorithm uses AMS Sketch and Bloom Filter to estimate the cardinality. Finally, the algorithm generates an optimization algorithm of the join based on the connection of graph. Experiments have proven that the BFP algorithm can prune the input of join by up to 12% without considering the join order. The algorithm for join plan generation can produce the optimal plans in 14 out of 18 queries without pre-collecting statistics data and save execution time by up to 31%, and the time spent on the collection of statistics information is no more than 5% of the total execution time.

Index Terms—Query optimization, spark SQL, bloom filter, sketch.

The authors are with the University of Electronic Science and Technology of China, Chengdu, 611731, China (e-mail: yongzhao1@qq.com).

[PDF]

Cite:Yong Zhao and Rong Chen, "Research on Runtime Query Optimization Technology of Spark SQL," International Journal of Computer Theory and Engineering vol. 14, no. 1, pp. 15-19, 2022.

Copyright © 2022 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).


Copyright © 2008-2024. International Association of Computer Science and Information Technology. All rights reserved.