Real-Time Analytics Optimization Using Apache Spark Structured Streaming: A Lambda Architecture-based Scala Framework
Keywords:
real-time analytics, Apache Spark, Structured Streaming, Lambda architecture, Big DataAbstract
In modern day enterprise, real time big data analytics plays a crucial role as this ecosystem require scalable, fault-tolerant, and low-latency processing frameworks. The aim of this research is to introduce a Scala-based Apache Spark’s Lambda architecture implementation Which is specially designed to enhance real time analytics optimization through the integration of spark structured streaming and batch processing mechanism.
Downloads
References
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters,” in Proc. 4th USENIX Workshop Hot Topics Cloud Comput. (HotCloud'12), Boston, MA, USA, 2012, pp. 1–6.
T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing,” in Proc. VLDB Endowment, vol. 8, no. 12, pp. 1792–1803, Aug. 2015.
J. Kreps, N. Narkhede, and J. Rao, “Kafka: A distributed messaging system for log processing,” in Proc. Netw. Large-Scale Data-Intensive Appl., 2011, pp. 1–7.
M. Zaharia et al., “Apache Spark: A unified engine for Big Data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016.
P. Carbone et al., “Apache Flink™: Stream and batch processing in a single engine,” IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28–38, Dec. 2015.
N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications, 2015.
G. Hesse and A. Lorenz, “Conceptual comparison of Lambda and Kappa architectures for Big Data processing,” in Proc. IEEE Int. Conf. Big Data (BigData), Washington, DC, USA, 2016, pp. 564–571.
T. Das et al., “Anomaly detection in large-scale data streams using Apache Spark,” in Proc. IEEE Int. Conf. Big Data (BigData), Santa Clara, CA, USA, 2016, pp. 1817–1822.
A. Ghosh and P. K. Gupta, “Performance comparison of Lambda and Kappa architectures for real-time sentiment analysis using Kafka, Spark, and Flink,” in Proc. IEEE Int. Conf. Inf. Technol. (ICIT), Bhubaneswar, India, 2018, pp. 327–332.
Y. Zhu, S. He, J. Zhang, and W. X. Zhao, “Real-time data processing using Spark Structured Streaming,” in Proc. IEEE Int. Conf. Cloud Comput. (CLOUD), San Francisco, CA, USA, 2019, pp. 543–548.
R. F. da Silva et al., “A performance evaluation of MongoDB and Cassandra for real-time streaming applications,” in Proc. IEEE Symp. Comput. Commun. (ISCC), Natal, Brazil, 2018, pp. 1–6.
B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, “R-Storm: Resource-aware scheduling in Storm,” in Proc. ACM Int. Conf. Middleware, Bordeaux, France, 2015, pp. 149–161.
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed stream computing platform,” in Proc. IEEE Int. Conf. Data Mining Workshops (ICDMW), Sydney, Australia, 2010, pp. 170–177.
M. Nasiri, M. Hasheminezhad, and A. Rahmani, “Optimizing real-time data processing using Spark Structured Streaming and Delta Lake,” in Proc. IEEE Int. Conf. Big Data (BigData), Los Angeles, CA, USA, 2020, pp. 4232–4239.
S. Wang and J. S. Dong, “A hybrid Lambda architecture for scalable real-time Big Data analytics,” Future Gener. Comput. Syst., vol. 120, pp. 69–82, Jul. 2021.
X. Cheng, W. Zhang, and X. Zhao, “Performance evaluation of Apache Spark on large-scale data processing,” in Proc. IEEE Int. Conf. Cloud Comput. Technol. Sci. (CloudCom), Singapore, 2018, pp. 212–219.
A. Sametinger, A. Mayer, and S. Szkaliczki, “Evaluating the fault tolerance of stream processing frameworks in real-time analytics,” in Proc. IEEE Int. Conf. Dependable Syst. Netw. (DSN), Valencia, Spain, 2019, pp. 546–553.
T. Li et al., “Parallel and distributed real-time analytics with Apache Spark and Scala,” J. Parallel Distrib. Comput., vol. 151, pp. 13–23, Mar. 2021.
H. Lu, D. Wang, J. Li, and K. Li, “Efficient event-time processing in streaming analytics using Spark Structured Streaming,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, Philadelphia, PA, USA, 2022, pp. 1095–1108.
R. K. Saha, M. M. Hassan, and H. G. Elmongui, “Real-time Big Data analytics using Lambda architecture and Apache Spark,” in Proc. IEEE Int. Conf. Cloud Comput. (CLOUD), San Francisco, CA, USA, 2021, pp. 398–405.