Contact Details

My office is in Building 202, Room 114, Bar-Ilan University

ExplainED: Explanations for EDA Notebooks

Daniel Deutch, Amir Gilad, Tova Milo, Amit Somech
Demo Paper VLDB, 2020


Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative exploratory sessions that were created by fellow data scientists who examined the same dataset and shared their notebooks via online platforms. Unfortunately, creating an illustrative, well-documented notebook is cumbersome and time-consuming, therefore users sometimes share their notebook without explaining their exploratory steps and their results. Such notebooks are difficult to follow and to understand. To address this, we present ExplainED, a system that automatically attaches explanations to views in EDA notebooks. ExplainED analyzes each view in order to detect what elements thereof are particularly interesting, and produces a corresponding textual explanation. The explanations are generated by first evaluating the interestingness of the given view using several measures capturing different interestingness facets, then computing the Shapely values of the elements in the view, w.r.t. the interestingness measure yielding the highest score. These Shapely values are then used to guide the generation of the textual explanation. We demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented EDA notebooks.

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech
Conference Paper SIGMOD, 2020


Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

Automating Exploratory Data Analysis via Machine Learning: An Overview

Tova Milo, Amit Somech
Conference Tutorial SIGMOD, 2020


Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.

Incremental Based Top-k Similarity Search Framework for Interactive-Data-Analysis Sessions

Oded Elbaz, Tova Milo, Amit Somech
Conference Paper EDBT, 2020


Interactive Data Analysis (IDA) is a core knowledge-discovery process, in which data scientists explore datasets by issuing a sequence of data analysis actions (e.g. filter, aggregation, visualization), referred to as a session. Since IDA is a challenging task, special recommendation systems were devised in previous work, aimed to assist users in choosing the next analysis action to perform at each point in the session. Such systems often record previous IDA sessions and utilize them to generate next-action recommendations. To do so, a compound, dedicated session-similarity measure is employed to find the top-k sessions most similar to the session of the current user. Clearly, the efficiency of the top-k similarity search is critical to retain interactive response times. However, optimizing this search is challenging due to the non-metric nature of the session similarity measure.

To address this problem we exploit a key property of IDA, which is that the user session progresses incrementally, with the top-k similarity search performed, by the recommender system, at each step. We devise efficient top-k algorithms that harness the incremental nature of the problem to speed up the similarity search, employing a novel, effective filter-and-refine method. Our experiments demonstrate the efficiency of our solution, obtaining a running-time speedup of over 180X compared to a sequential similarity search.

Towards Autonomous, Hands-Free Data Exploration

Ori Bar El, Tova Milo, Amit Somech
Conference Paper CIDR, 2020


Exploratory Data Analysis (EDA) is an important yet difficult task, currently performed by expert users, as it requires deep understanding of the data domain as well as profound analytical skills. In this work we make the case for the Hands-Free EDA (HFE) paradigm, in which the exploratory process is automatically conducted, requiring little or no human input as in watching a “video” presenting selected highlights of the dataset. To that end, we suggest an end-to-end visionary system architecture, coupled with a prototype implementation. Our preliminary experimental results demonstrate that HFE is achievable, and leads the way for improvement and optimization research.

ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech
Demo Paper CIKM, 2019


Exploratory Data Analysis (EDA), is an important yet challenging task, that requires profound analytical skills and familiarity with the data domain. While Deep Reinforcement Learning (DRL) is nowadays used to solve AI challenges previously considered to be intractable, to our knowledge such solutions have not yet been applied to EDA.

In this work we present ATENA, an autonomous system capable of exploring a given dataset by executing a meaningful sequence of EDA operations. ATENA uses a novel DRL architecture, and learns to perform EDA operations by independently interacting with the dataset, without any training data or human assistance. We demonstrate ATENA in the context of cyber security log analysis, where the audience is invited to partake in a data exploration challenge: explore real-life network logs, assisted by ATENA, in order to reveal underlying security attacks hidden in the data.

Declarative User Selection with Soft Constraints

Yael Amsterdamer, Tova Milo, Amit Somech Brit Youngmann
Conference Paper CIKM, 2019


In applications with large userbases such as crowdsourcing, social networks or recommender systems, selecting users is a common and challenging task. Different applications require different policies for selecting users, and implementing such policies is application-specific and laborious.

To this end, we introduce a novel declarative framework that abstracts common components of the user selection problem, while allowing for domain-specific tuning. The framework is based on an ontology view of user profiles, with respect to which we define a query language for policy specification. Our language extends SPARQL with means for capturing soft constraints which are essential for worker selection. At the core of our query engine is then a novel efficient algorithm for handling these constraints. Our experimental study on real-life data indicates the effectiveness and flexibility of our approach, showing in particular that it outperforms existing task-specific solutions in prominent user selection tasks.

3 Lessons Learned from Implementing a Deep Reinforcement Learning Framework for Data Exploration

Ori Bar El, Tova Milo,Amit Somech
Workshop Paper AIDB@VLDB, 2019


We examine the opportunities and the challenges that stem from implementing a Deep Reinforcement Learning (DRL) framework for Exploratory Data Analysis (EDA). We have dedicated a considerable effort in the design and the development of a DRL system that can autonomously explore a given dataset, by performing an entire sequence of analysis operations that highlight interesting aspects of the data.

In this work, we describe our system design and development process, particularly delving into the major challenges we encountered and eventually overcame. We focus on three important lessons we learned, one for each principal component of the system: (1) Designing a DRL environment for EDA, comprising a machine-readable encoding for analysis operations and result-sets, (2) formulating a reward mechanism for exploratory sessions, then further tuning it to elicit a desired output, and (3) Designing an efficient neural network architecture, capable of effectively choosing between hundreds of thousands of distinct analysis operations. We believe that the lessons we learned may be useful for the databases community members making their first steps in applying DRL techniques to their problem domains.

Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs

Tova Milo, Chai Ozeri, Amit Somech
Conference Paper EDBT, 2019


Assessing the interestingness of data analysis actions has been the subject of extensive previous work, and a multitude of interestingness measures have been devised, each capturing a different facet of the broad concept. While such measures are a core component in many analysis platforms (e.g., for ranking association rules, recommending visualizations, and query formulation), choosing the most adequate measure for a specific analysis task or an application domain is known to be a difficult task.

In this work we focus on the choice of interestingness measures particularly for Interactive Data Analysis (IDA), where users examine datasets by performing sessions of analysis actions. Our goal is to determine the most suitable interestingness measure that adequately captures the user’s current interest at each step of an interactive analysis session. We propose a novel solution that is based on the mining of IDA session logs. First, we perform an offline analysis of the logs, and identify unique characteristics of interestingness in IDA sessions. We then define a classification problem and build a predictive model that can select the best measure for a given a state of a user session. Our experimental evaluation, performed over real-life session logs, demonstrates the sensibility and adequacy of our approach.

Boosting SimRank with Semantics

Tova Milo, Amit Somech Brit Youngmann
Conference Paper EDBT, 2019


The problem of estimating the similarity of a pair of nodes in an information network draws extensive interest in numerous fields, e.g., social networks and recommender systems. In this work we revisit SimRank, a popular and well studied similarity measure for information networks, that quantifies the similarity of two nodes based on the similarity of their neighbors. SimRank’s popularity stems from its simple, declarative definition and its efficient, scalable computation. However, despite its wide adaptation, it has been observed that for many applications SimRank may yield inaccurate similarity estimations, due to the fact that it focuses on the network structure and ignores the semantics conveyed in the node/edge labels. Therefore, the question that we ask is can SimRank be enriched with semantics while preserving its advantages?

We answer the question positively and present SemSim, a modular variant of SimRank that allows to inject into the computation any semantic similarly measure, which satisfies three natural conditions. The probabilistic framework that we develop for SemSim is anchored in a careful modification of SimRank’s underlying random surfer model. It employs Importance Sampling along with a novel pruning technique, based on unique properties of SemSim. Our framework yields execution times essentially on par with the (semantic-less) SimRank, while maintaining negligible error rate, and facilitates direct adaptation of existing SimRank optimizations. Our experiments demonstrate the robustness of SemSim, even compared to task-dedicated measures.

SimMeme: A Search Engine for Internet Memes

Tova Milo, Amit Somech Brit Youngmann
Conference Paper ICDE, 2019


As more and more social network users interact through Internet Memes, an emerging popular type of captioned images, there is a growing need for users to quickly retrieve the right Meme for a given situation. As opposed conventional image search, visually similar Memes may reflect different concepts. Intent is sometimes captured by user annotations (e.g., tags), but these are often incomplete and ambiguous. Thus, a deeper analysis of the relations among Memes is required for an accurate, custom search.

To address this problem, we present SimMeme, a Meme-dedicated search engine. SimMeme uses a generic graph-based data model that aligns various types of information about the Memes with a semantic ontology. A novel similarity measure that effectively considers all incorporated data is employed and serves as the foundation of our system. Our experimental results achieve using common evaluation metrics and crowd feedback, over a large repository of real-life annotated Memes, show that in the task of Meme retrieval, SimMeme outperforms state-of-the-art solutions for image retrieval.

Deep Reinforcement-Learning Framework for Exploratory Data Analysis

Tova Milo, Amit Somech
Workshop Paper AIDM@SIGMOD, 2018


Deep Reinforcement Learning (DRL) is unanimously considered as a breakthrough technology, used in solving a growing number of AI challenges previously considered to be intractable. In this work, we aim to set the ground for employing DRL techniques in the context of Exploratory Data Analysis (EDA), an important yet challenging, that is critical in many application domains. We suggest an end-to-end framework architecture, coupled with an initial implementation of each component. The goal of this short paper is to encourage the exploration of DRL models and techniques for facilitating a full-fledged, autonomous solution for EDA.

Next-step Suggestions for Modern Interactive Data Analysis Platforms

Tova Milo, Amit Somech
Conference Paper KDD, 2018


Modern Interactive Data Analysis (IDA) platforms, such as Kibana, Splunk, and Tableau, are gradually replacing traditional OLAP/SQL tools, as they allow for easy-to-use data exploration, visualization, and mining, even for users lacking SQL and programming skills. Nevertheless, data analysis is still a di cult task, especially for non-expert users. To that end we present REACT, a recommender system designed for modern IDA platforms. In these platforms, analysis sessions interweave high-level actions of multiple types and operate over diverse datasets . REACT identifies and generalizes relevant (previous) sessions to generate personalized next-action suggestions to the user.

We model the user’s analysis context using a generic tree based model, where the edges represent the user’s recent actions, and the nodes represent their result “screens”. A dedicated context-similarity metric is employed for efficient indexing and retrieval of relevant candidate next-actions. These are then generalized to abstract actions that convey common fragments, then adapted to the specific user context. To prove the utility of REACT we performed an extensive online and offline experimental evaluation over real-world analysis logs from the cyber security domain, which we also publish to serve as a benchmark dataset for future work.

December: A Declarative Tool for Crowd Member Selection

Yael Amsterdamer, Tova Milo, Amit Somech, Brit Youngmann
Demo Paper VLDB, 2016


Adequate crowd selection is an important factor in the success of crowdsourcing platforms, increasing the quality and relevance of crowd answers and their performance in different tasks. The optimal crowd selection can greatly vary depending on properties of the crowd and of the task. To this end, we present December, a declarative platform with novel capabilities for flexible crowd selection. December supports the personalized selection of crowd members via a dedicated query language Member-QL. This language enables specifying and combining common crowd selection criteria such as properties of a crowd member’s profile and history, similarity between profiles in specific aspects and relevance of the member to a given task. This holistic, customizable approach differs from previous work that has mostly focused on dedicated algorithms for crowd selection in specific settings. To allow efficient query execution, we implement novel algorithms in December based on our generic, semantically aware definitions of crowd member similarity and expertise. We demonstrate the effectiveness of December and MemberQL by using the VLDB community as crowd members and allowing conference participants to choose from among these members for different purposes and in different contexts.

REACT: Context-Sensitive Recommendations for Data Analysis

Tova Milo, Amit Somech
Demo Paper SIGMOD, 2016


Data analysis may be a difficult task, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. In this demo we present REACT, a system that hooks to the analysis UI and provides the users with personalized recommendations of analysis actions. By matching the current user session to previous sessions of analysts working with the same or other data sets, REACT is able to identify the potentially best next analysis actions in the given user context. Unlike previous work that mainly focused on individual components of the analysis work, REACT provides a holistic approach that captures a wider range of analysis action types by utilizing novel notions of similarity in terms of the individual actions, the analyzed data and the entire analysis workflow.

We demonstrate the functionality of REACT, as well as its effectiveness through a digital forensics scenario where users are challenged to detect cyber attacks in real life data achieved from honeypot servers.

Managing General and Individual Knowledge in Crowd Mining Applications

Yael Amsterdamer, Susan B. Davidson, Anna Kukliansky, Tova Milo, Slava Novgorodov, and Amit Somech.
Conference Paper CIDR, 2015


Crowd mining frameworks combine general knowledge, which can refer to an ontology or information in a database, with individual knowledge obtained from the crowd, which captures habits and preferences. To account for such mixed knowledge, along with user interaction and optimization is- sues, such frameworks must employ a complex process of reasoning, automatic crowd task generation and result analysis. In this paper, we describe a generic architecture for crowd mining applications. This architecture allows us to examine and compare the components of existing crowdsourcing systems and point out extensions required by crowd mining. It also highlights new research challenges and potential reuse of existing techniques/components. We exemplify this for the OASSIS project and for other prominent crowdsourcing frameworks.

OASSIS: Query Driven Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,
Conference Paper SIGMOD, 2014


Crowd data sourcing is increasingly used to gather information from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns. Our approach is based on (1) a simple generic model that captures both ontological knowledge as well as the individual history or habits of crowd members from which frequent patterns are mined; (2) a query language in which users can declaratively specify their information needs and the data patterns of interest; (3) an efficient query evaluation algorithm, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementation of these ideas that mines the crowd through an interactive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.

Ontology Assisted Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,
Demo Paper VLDB, 2014


We present OASSIS (for Ontology ASSISted crowd mining), a prototype system which allows users to declaratively specify their information needs, and mines the crowd for answers. The answers that the system computes are concise and relevant, and represent frequent, significant data patterns. The system is based on (1) a generic model that captures both ontological knowledge, as well as the individual knowledge of crowd members from which frequent patterns are mined; (2) a query language in which users can specify their information needs and types of data patterns they seek; and (3) an efficient query evaluation algorithm, for mining semantically concise answers while minimizing the number of questions posed to the crowd.

Currently Teaching

  • 2021 Now

    Data Science Workshop

    The purpose of the workshop is to provide students with practical experience in data science. During the workshop, students will carry out a large-scale project in which they will select a dataset and a prediction task, then perform a complete data science process that includes: defining the prediction problem and evaluation metrics, cleaning the data, selecting and creating features, selecting the right model, and performing parameters tuning. Upon completing a basic prediction model, the students will perform a model quality analysis, then continuously improve the model to obtain better results.

  • Soon (2021)

    Tabular Data Science

    This course provides an in-depth review of the data scientific pipeline from a data-centric perspective. Focusing on tabular data, we will study several key tasks in the pipeline that facilitate insights and knowledge extraction: from data cleaning, visualization, and pattern mining, to interpretability and explainability of predictive models.
    For each topic, we will begin with a high-level overview, then study one or two representative algorithms/methods, and conclude with a case-study example of how these methods are used for extracting insights. Finally, towards the end of the course, we will discuss whether and how the data scientific pipeline can be automated.

  • Soon (2021)

    Advanced Seminar in Automation and ML in Tabular Data Analytics

    Together with the recent, meteoric increase in the amount of tabular data, the need for efficient and rapid analysis of this data has increased as well. More and more organizations, led by "Big Tech" companies, recognize the vast potential in extracting insights and conclusions from their data. In this seminar, we will focus on a young but rapidly evolving field of research, in which scientists, from both academia and industry, develop automated solutions that simplify and facilitate data analysis, science, and mining.

Past Courses

  • 2016 2020

    Workshop in Data Science (@Tel Aviv University)

    The workshop focuses on knowledge extraction from raw data, using statistical tools and machine learning algorithms. Participating students are required to design and implement such a system and present their results in class.

  • 2016 2019

    Database Systems (@Tel Aviv University)

    The purpose of this course is to provide an introduction to the design and use of database systems. We begin by covering the relational model and the SQL language, then study methods for database design, covering the entity relationship model. Finally, we touch on some advanced topics in database systems. The recitation classes will cover practical topics in database programming.