Big data security analytics

Thinking of the Web as a “techno-social system for the interaction between human and technological networks,” Karan Patel has classified the Web into the following five generations (2013):
• Web 1.0 as a Web of information or percipience
• Web 2.0 as Web of verbalization
• Web 3.0 as Web of affiliation
• Web 4.0 as a Web of integration
• Web 5.0 as a Web of decentralized smart communicator
This classification approach implies that as the Web grows, the types of data in the Web become more and more diverse and sophisticated. Although Patel has done an impressive comparison research work on all known generations of the Web, there is one weakness of that research—it did not provide any identification and analysis on performance bottlenecks in accessing the large-scale data in all the generations of the Web. Therefore, to bridge this gap, you are asked to conduct a comparison research for this subject. You will be asked to create a short research report to compare the characteristics of the performance bottlenecks that occurred in accessing the large-scale Web data in all generations of the Web.
Study the reading assignments first, and then write a comparison study report focusing on the following 4 aspects:
• Identify the main sources of generating the large-scale web data in each generation of the Web.
• Identify the typical "places" where the performance bottlenecks of accessing the large-scale of data for each generation of the Web will occur.
• Analyze the root causes for generating these performance bottlenecks.
• Propose your high-level strategies to remove these performance bottlenecks.
You need to present the justification or rationale on your point of view and use examples to illustrate your point of view. Find at least 2 references from the Library and the Internet based on your research interests. You need at least 5 references for this report, including the 3 references listed in the reading assignments.

In Web 3.0, the Resource Description Framework (RDF) is used for conceptual description or modeling of the information implemented in Web resources because of its ability to represent data in a machine-readable format. It has been widely applied in many knowledge management applications of Web 3.0. An RDF data set consists of a set of RDF triples in the format (subject, predicate, object). To be able to globally identify a Web resource uniquely, Uniform Resource Identifiers (URI) are used in RDF (Brickley & Guha, 2014). Such triple format can be translated into a directed labeled graph. Query RDF triples are technically equivalent to conducting a large number of join operations. Traditional relational query cannot handle such large number of "star joins" efficiently. Therefore, many research works introduced MapReduce into their solutions to apply parallel computing to improve RDF query performance. However, due to many aspects involved in this subject, these research works have applied many different approaches to introducing MapReduce as part of their solutions (Sakr & Gaber, 2014).
In this assignment, you are asked to conduct a survey of the existing research works on the approaches to applying MapReduce to improve RDF data query processing performance. You need to create a short research report to compare at least 2 solutions that have applied distinctive approaches in introducing MapReduce into their RDF data query processing solutions.
Create a survey report with a focus on the following four aspects:
• Identify the main focus of each solution.
• Identify the main technical changes made to the MapReduce framework by the solution.
• Specify the rationale for these technical changes.
• Analyze the pros and cons of each solution.

The increasing popularity of cloud computing and market competition among cloud computing service providers creates an urgent need to have an easy and effective performance analysis method suited to the large Infrastructure as a Service (IaaS) cloud computing environment. This would consist of millions of physical and virtual computers so that the cloud computing service providers can most efficiently provision and configure their computation resources to achieve the optimized utilization of their computation resources.
Theoretically, the following three approaches can be used to conduct performance analysis to any type of target system:
• Experiment-based performance analysis
• Discrete event simulation-based performance analysis
• Stochastic model-based performance analysis
Because of the very large scale of cloud computing environment, both experiment-based performance analysis and discrete event simulation are not cost-effective due to the number of computer resource and the time required to run the experiments. Therefore, from the low-cost point view, the stochastic model-based performance analysis becomes the only natural choice.
However, the nature of the large scale of the cloud computing has also posed the challenge of building a practical and scalable stochastic model. If the conventional stochastic model building principle, such as capturing as many details as possible of the cloud computing environment into one-level monolithic model, is applied, the resulted state space of the Markov model will become very large. This may make the generation and solution of such a model a prohibitively difficult task.
Sakr and Gaber (2014) have proposed a relatively simple but also scalable stochastic model for analyzing performance of IaaS cloud computing environment based on the interactions among several submodels such that the overall solution is composed by the iteration over individual submodel solutions.
Create a study report focusing on the following aspects:
• Identify the main concept of the stochastic model proposed by Sakr and Gaber.
• Discuss why the three-pool cloud architecture-based stochastic model proposed by Sakr and Gaber can be scalable and practical to do performance analysis on IaaS cloud computing environment.
• Discuss the main limitation of the three-pool cloud architecture model and how to remove the limitation.
• Discuss the potential applications that can be developed based on this type of performance analysis approach.

Data source Data are generated from multiple sources and input asynchronously to multiple servers for processing.
Data processing Jobs from stream processing applications run continuously from job submission until cancellation.
Query on the fly Queries of streaming applications are continuous and run against the input data on-the-fly and provide prompt results.
Architecture Most of stream processing engines use the centralized architecture, whereas some recent stream processing engines use the distributed architecture.
Scalability Various components such as box processor, load shedder, priority scheduler, local optimizer, and so forth are used to adjust the load in processing to achieve performance optimization.
Due to the special characteristics of stream processing engines, MapReduce and its open-source implementation Hadoop, which is mainly designed for handling batch processing tasks, are not adequate for supporting real-time stream processing. Therefore, many development efforts have created quite a few stream processing engines with the proper components to support the special requirements of the stream processing engines with parallel and scalable computation architecture. Many of them have been presented in Sakr and Gaber.
Study the reading assignments first, and write a comparison study report focusing on the following aspects:
• Identify 2 different streaming processing systems, and specify the main technical features of each system.
• Compare them based on the following 3 criteria:
o System architecture
o Performance optimization capability
o Scalability
• Identify the performance bottlenecks of processing the large-scale streaming dataset for each system.
• Analyze the root causes for the performance bottlenecks for each system.
• Propose your high-level strategies to remove these performance bottlenecks.

• Identify the main functions such as anomaly detection capability, event correlation capability, real-time analytics capability, and so forth of each tool.
• Assess if the tool is suited or used in the cloud computing environment.
• Identify the targeted applications or security problems that each tool is most suited to solve (e.g., Web application, financial application, insider threats analysis, full-spectrum fraud detection, and Internet-scale botnet discovery).
• Briefly describe the key design considerations for scalability of each tool.
• Discuss the pros and cons of each tool.