A Big Data Processing Pipeline for Cloud Environments

	EE 4thyr Projects 2015/2016

Project Title:	A Big Data Processing Pipeline for Cloud Environments
Supervisor:	Miguel Rodrigues	Submitted on: Saturday 14th February 2015

This era is characterized by the collection of massive amounts of information in various domains ranging from science to technology to commerce to national defense and telecommunications. Indeed, giants such as Google, Facebook, Yahoo!, Microsoft and others have data of the orders of exabytes, and the size of social media platforms (e.g., Facebook, YouTube, Twitter) have exploded beyond anyone’s wildest imagination, connecting hundreds of millions of users.

The magnitude of such a data deluge can also be portrayed by noting that every minute > 100 hours video are uploaded onto Youtube; > 200,000 photos are uploaded onto Facebook; > 100,000 tweets are sent to Twitter; and there are 2M Google search requests.

It is widely acknowledged that the effective use of such amounts of Big Data has the potential to transform not only science – with expectations surging for new scientific discoveries pinned on emerging massive collections of biological, chemical, physical, and social data – but also economies by delivering a new wave of productivity growth and consumer surplus enabled via more adaptive, personalized and robust technology. The McKinsey Global Institute points to $300 billion potential annual value to US healthcare; 250€ billion potential annual value to Europe’s public sector; etc.

However, with the trends in growth in data worldwide due to increase substantially over the coming years, from 1.2 ZB in 2010 to 8.5 ZB in 2015 to 44 ZB in 2020 one is confronted with two major inter-related information bottleneck challenges:

1. It is often argued that today’s infrastructure – without radical changes to the way one collects, analyzes and processes information – will not be able to keep up with the current pace of data growth in view of fundamental limitations in transmission, storage and computation capability (cf. Kryder’s Law, Moore’s Law);

2. It is also argued that big data may just be too big to enable the extraction of meaningful, reliable and actionable information to support decision-making (within a useful time span).

The goal of this research project is to develop big data processing pipeline that leverages recent advances in big data inference algorithms, with a view to integrate in cloud-type of environments.
The main elements of this research programme include:

1. Familiarization with state-of-the-art algorithms for big data inference, including data reconstruction and classification algorithms;

2. Implementation, testing and benchmarking of the big data inference algorithms using high-level programming languages;

3. Implementation of a big data processing pipeline for cloud environments.

This project covers new research in the emerging fields of Big Data, so high-level publications would be anticipated as a possible outcome of the research. This project is therefore ideal to pave the way to a successful career in industry or academia.

Additional Information:

This research work requires students with good programming skills. Potential students are encouraged to contact the proponent to discuss further details about the project.

This page has been automatically generated by the EE Project Submissions Facility
Back to Projects index