Haku

Comparison of services in Amazon Web Services for Big Data Processing

QR-koodi

Comparison of services in Amazon Web Services for Big Data Processing

Big Data processing involves processing large volume of multi-dimensional data. The data can be either structured or unstructured data, depending on the complexity of data and data transformations, users need to make sure that their infrastructure is computationally sufficient to manage the data transformation required for the organization or business. Big data is growing exponentially, small scale businesses may not afford to setup an infrastructure needed to process data, as it involves spending money upfront to setup infrastructure before the business starts to make money, moreover it is getting bigger issue for large scale businesses, as the technology hardware setup in these on-premises data centres the businesses go out of computational capacity, and these businesses needs to renew the infrastructure every 2-4 years depending on their computational complexity. To manage this problem, any business can take advantage of cloud service providers, three such leading cloud services providers are Google Cloud Platform, Amazon Web Services and Azure.

Services provided by the cloud providers and their cost are a principal factor when choosing a cloud provider. This study compares AWS services for big data processing: EKS, EMR, EMR on EC2, EMR on EKS, EMR Serverless for big data processing in terms of parameters such as Time to Provision Infrastructure, Execution Time, Cost of Execution and Maintainability of Infrastructure. The study compares these services by executing applications in these services e.g. A Stock Analysis application was designed 2 ways: An Apache Spark application and a Presto Query application, at first the applications were validated to verify that they produce the same results. Presto query is used in Amazon Athena and the Spark application in rest of the services. During the execution of the application, data related to different parameters such as Time to provision infrastructure, time for execution and complexity of creating this infrastructure using shell scripts were collected.

The results indicated that there is a cost of creating and maintaining the infrastructure and serverless infrastructure could provide advantages to the user. When comparing serverless application such as Amazon Athena and Amazon EMR Serverless, Athena seems to be cheaper and faster. With user created and maintained infrastructure EMR on EC2 is the most efficient and cheapest service on this stock analysis dataset (here the user account is limited to maximum of 64 vCPU), the study also concludes that serverless infrastructures are easier to maintain.

Tallennettuna: