1 / 16

Nested JSON data processing using Apache Spark with Coding

Here we have given information about nested JSON data processing using Apache Spark in this article and given some necessary code related to it, then go to the end of this article to get more information about it.

aegiscanada
Download Presentation

Nested JSON data processing using Apache Spark with Coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nested JSON Data processing using Apache Spark

  2. Instructions for use Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. What are we performing in this demo? ╺ Read data from the URL using scala API ╺ Convert the read data into a dataframe ╺ Extract the required fields from the nested JSON dataset ╺ Analyze the data by writing queries ╺ Visualize the processed data 2

  3. Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. After this, we use the jsonString Val created above and create a dataframe using Spark API. We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then we create a dataframe out of it. 3

  4. Now let us see the schema of the JSON using printSchema method:

  5. Now let us see the schema of the JSON using printSchema method: |-- data: array (nullable = true) | |-- element: array (containsNull = true) | | |-- element: string (containsNull = true)) Also, it contains metadata about the data, let’s not worry about it, for now. But you can have a look at it when you run this in your machine. Mainly it contains columns field information in metadata, which I have extracted for you to have a better understanding of the data we will work on. 5

  6. We have below fields within an Array of data that we are going to analyze. Sid meta ╺ ╺ Id Year ╺ ╺ Position first_name ╺ ╺ created_at County ╺ ╺ created_meta Sex ╺ ╺ updated_at Count ╺ ╺ updated_meta ╺ 6

  7. But how we can extract these data fields from JSON? Now let’s select data from the jsonDF dataframe we created. It looks something like this 7

  8. Now we have to extract the fields within this data. To do this, let us first create a temporary view of this dataframe and use explode function to extract Year, Name, County, and gender fields. To use explode method, we should first import spark sql functions. 8

  9. Now let us see the schema of the insightData. 9

  10. Let me show you the contents of insightData datafrmae using the display method available in Databricks. 10

  11. Now let us write a query to see what is the most popular first letter baby names to start within each year. insightData.select("year","name").createOrReplaceTempView("yearname") val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count ,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as firstLetter, count(1) as count from yearname group by year ,firstLetter order by year desc,count desc)Y )Z where ranks=1 order by year desc") 11

  12. Now let’s visualize this data using the graphs available in Databricks. 12

  13. 13

  14. Apache Spark Integration Services With 15+ years in data analytics technology services, Aegis Softwares Canada expert offers a wide range of apache spark implementation, integration, and development solutions also 24/7 support. 14

  15. AEGIS SOFTWARE OFSHORE SOFTWARE DEVELOPMENT COMPANY INDIA (Head Office) 319, 3rd Floor, Golden Plaza, Tagore Road, Rajkot – 360001 Gujarat, India CANADA (Branch Office) 2 Robert Speck Parkway, Suite 750, Mississauga, ON Ontario-L4Z1H8, Canada. info@aegissoftwares.com www.aegissoftwares.com

  16. Thank you 16

More Related