9–11 Oct 2018
Lisbon
Europe/Lisbon timezone

Deep Learning for Predicting the Popularity of Datasets

10 Oct 2018, 16:45
15m
Lisbon

Lisbon

ISCTE, University of Lisbon
Presentation Area 3. Computing and Virtual Research Environments Computing Services Part II

Speaker

Mrs Nina Zimmermann (Univ. of Applied Sciences (HTW) Berlin)

Description

Accessing datasets stored on tape drives is comparatively time-consuming. Therefore, a certain fraction of all datasets is usually provided on a cache storage built of hard disks. Caching algorithms are used to identify popular datasets and to move them in advance from tape drives to the cache storage. In general, there is a considerable gap between the effectiveness of traditional caching algorithms and the optimal (or Belady) caching algorithm. It seems to be unlikely that the gap can be reduced significantly by optimizing traditional caching algorithms. The aim of our project is to explore whether popular datasets can be identified more optimally by applying deep learning methods. Training a neutral network is time-consuming. This is true, in particular, if the training sets are large. The Atlas experiment at the Large Hadron Collider (LHC) stores every access to datasets in log files (many parameters are saved such as the name of the file, name of the dataset the file belongs to, the tool used for accessing the file, and the access time). In total log data of the order of 0.5 TB are stored per month. Applying deep learning techniques to large datasets needs a scalable infrastructure. To speed up the training of neural networks, several proposals were submitted, for example the use of specialized processors like GPUs or TPUs. We designed a cluster of containers for running neural networks in parallel. The cluster allows to investigate different distributed deep learning strategies, e.g. data parallelism and model parallelism. To distribute files across the nodes of the cluster and to train neural networks in parallel, the big data analytics frameworks Apache Flink and Apache Spark are used. The talk gives an overview of the current status of our project. The machine learning workflow running on the cluster system is presented. First results obtained by applying a Convolutional Neural Network to a small subset of Atlas log data are shown. The speedup of different parallelization strategies is evaluated. An outlook on ongoing work will be given.

Summary

There are many "traditional" caching algorithms for solving the problem of determining the popularity of datasets. Our project explores whether popular datasets can be identified more optimally by applying deep learning methods. We have developed a cluster of containers for comparing different distributed deep learning strategies. The talk gives an overview of the current status of our project and presents first results.

Type of abstract Presentation

Primary author

Mrs Nina Zimmermann (Univ. of Applied Sciences (HTW) Berlin)

Co-authors

Mr Daniel Nagel (Univ. of Applied Sciences (HTW) Berlin) Mr Florian Thom (Univ. of Applied Sciences (HTW) Berlin) Mr Hannes Fuchs (Univ. of Applied Sciences (HTW) Berlin) Prof. Hermann Hessling (Univ. of Applied Sciences (HTW) Berlin) Mr Marco Strutz (Univ. of Applied Sciences (HTW) Berlin) Mr Maximilian Menzel (Univ. of Applied Sciences (HTW) Berlin) Mr Tobias Wochinger (Univ. of Applied Sciences (HTW) Berlin)

Presentation materials

There are no materials yet.