Part V spark streaming Programming Guide

Part IV spark streaming Programming Guide (1) The execution mechanism of Spark Streaming, Transformations and Output Operations, Spark Streaming data sources and Spark Streaming sinks are discussed. This article will continue the previous content, mainly including the following contents: Stateful calculation Time based window operation Persist ...

Posted by predhtz on Tue, 24 May 2022 22:24:57 +0300

spark learning -- spark SQL

Spark SQL Spark SQL is a module that Spark uses to process structured data. It provides a programming abstraction called DataFrame and acts as a distributed SQL query engine. In Hive, Hive SQL is converted into MapReduce and then submitted to the cluster for execution, which greatly simplifies the complexity of writing MapReduce programs, bec ...

Posted by SoaringUSAEagle on Mon, 23 May 2022 18:39:54 +0300

Hudi Spark SQL summary

preface Summarizing the use of Hudi Spark SQL, I still use Hudi 0 Taking version 9.0 as an example, some changes in the latest version will also be mentioned slightly. Hudi has supported Spark SQL since version 0.9.0. It was contributed by pengzhiwei, a classmate of Alibaba. Pengzhiwei is no longer responsible for Hudi, but is instead in the c ...

Posted by TheBentinel.com on Thu, 19 May 2022 00:31:54 +0300

Programming practice of Spark Streaming+Kafka based on Python

explain There are many articles explaining the principle of Spark Streaming, which will not be introduced here. This paper mainly introduces the programming model, coding practice and some optimization instructions using Kafka as the data source spark streaming:http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html streaming-kafka ...

Posted by Adam_28 on Mon, 16 May 2022 21:59:57 +0300

The only way for data development - data tilt

foreword Data skew is the most common problem in data development, and it is also a question that must be asked in interviews. So why is the data skewed? When will data skew occur? and how to solve it? What is data skew: The essence of data skew is uneven data distribution. Some tasks process a large amount of data, which leads to a longe ...

Posted by Paris! on Fri, 13 May 2022 21:01:37 +0300

Comparison of spark on k8s and spark on k8s operator

For the current k8s-based spark application, there are mainly two ways to run spark natively supports spark on k8s k8s-based operator spark on k8s operator The former is the implementation of the k8s client introduced by the spark community to support the resource management framework of k8s The latter is an operator developed by the k8s ...

Posted by cyball on Wed, 11 May 2022 04:22:32 +0300

Spark core components, operation architecture and RDD creation

Spark core components Before explaining the Spark architecture, let's first understand several core components of Spark and find out their functions. 1. Application:Spark application User programs built on Spark, including Driver code and code running in the Executor of each node of the cluster. It consists of one or more Job jobs. As show ...

Posted by Someone789 on Sun, 08 May 2022 23:43:36 +0300

Analysis of five JOIN strategies of Spark

JOIN operation is a very common data processing operation. As a unified big data processing engine, Spark provides a very rich JOIN scenario. This article will introduce the five JOIN strategies provided by Spark, hoping to help you. This paper mainly includes the following contents: Factors affecting JOIN operation Five strategies implemented ...

Posted by Helminthophobe on Sat, 07 May 2022 15:14:37 +0300

[Frontiers of Trading Technology] Experience Sharing of Real-time Computing System Construction

Abstract: Real-time computing technology has been applied to various fields such as advertising, e-commerce, games, and entertainment. For example, e-commerce websites analyze user attributes in real time, and push relevant products to customers based on the analysis results; online games analyze player data in real time, and then analyze game ...

Posted by alvinho on Fri, 06 May 2022 03:10:34 +0300

Secondary development Spark enables JDBC to read Hive data of remote tenant cluster and implement Hive2Hive data integration of Hive in this cluster [Java]

background Shallow SQL boys may only know that after pyspark constructs the sparkSession object [of course, enableHiveSupport], write an SQL: spark.sql("write a SQL string here"); Then spark will complete various operations of select ing, querying data, inserting and overwriting data into the result table according to the SQL here. ...

Posted by Fearless_Fish on Wed, 04 May 2022 16:16:50 +0300