21 March 2012
Ex Boccherini - Piazza S. Ponziano 6 (Conference Room )
The seminar will introduce the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different ways. Effectively extracting information requires careful design of algorithms for specific categories of data. The Web is huge, but DISC systems offer a platform for building scalable solutions. However, they provide restricted computing primitives for the sake of performance. Efficiently harnessing the power of parallelism offered by DISC systems involves rethinking traditional algorithms. In the seminar I will show how to tackle come classical and news problems in the context of massive scale Web mining, and how to design efficient MapReduce and streaming algorithms to solve these problems on DISC systems.