Contact Details

My office is in Building 503, Room 203, Bar-Ilan University

EFFECTS: Explorable and Explainable Feature Extraction Framework for Multivariate Time-Series Classification

Ido Ikar, Amit Somech
Demo Paper CIKM, 2023

Abstract

We demonstrate EFFECTS, an automated system for explorable and explainable feature extraction for multivariate time series classification. EFFECTS has a twofold contribution: (1) It significantly facilitates the exploration of MTSC data, and (2) it generates informative yet intuitive and explainable features to be used by the classification model. EFFECTS first mines the MTS data and extracts a set of interpretable features using an optimized transform-slice-aggregate process. To evaluate the quality of EFFECTS features, we gauge how well each feature distinguishes between every two classes, and how well they characterize each single class. Users can then explore the MTS data via the EFFECTS Explorer, which facilitates the visual inspection of important features, dimensions, and time slices. Last, the user can use the top features for each class when building a classification pipeline. We demonstrate EFFECTS on several real-world MTSC datasets, inviting the audience to investigate the data via EFFECTS Explorer and obtain initial insights on the time series data. Then, we will show how EFFECTS features are used in an ML model, and obtain accuracy that is on par with state-of-the-art MTSC models that do not optimize on explainability.

Cluster-Explorer: Explaining Black-Box Clustering Results

Sariel Tutay, Amit Somech
Demo Paper CIKM, 2023

Abstract

Interpreting clustering results is a challenging, manual task, that often requires the user to perform additional analytical queries and visualizations. To this end, we demonstrate Cluster-Explorer, an interactive, easy-to-use framework that provides explanations for black-box clustering results. Cluster-Explorer takes as input the raw dataset alongside cluster labels, and automatically generates multiple coherent explanations that characterize each cluster. We first propose a threefold quality measure that considers the conciseness, cluster coverage, and separation error of an explanation. We tackle the challenge of efficiently computing high-quality explanations using a modified version of a generalized frequent-itemsets mining (gFIM) algorithm. The gFIM algorithm is employed over multiple filter predicates which are extracted by applying various binning methods of different granularities. We implemented Cluster-Explorer as a Python library that can be easily used by data scientists in their ongoing workflows. After employing the clustering pipeline of their choice, Cluster-Explorer opens an integrated, interactive interface for the user to explore the various different explanations for each cluster. In our demonstration, the audience is invited to use Cluster-Explorer on numerous real-life datasets and different clustering pipelines and examine the usefulness of the cluster explanations provided by the system, as well as its efficiency of computation.

ATENA-PRO: Generating Personalized Exploration Notebooks with Constrained Reinforcement Learning

Tavor Lipman, Tova Milo, Amit Somech
Demo Paper SIGMOD, 2023

Abstract

One of the most common, helpful practices of data scientists, when starting the exploration of a given dataset, is to examine existing data exploration notebooks prepared by other data analysts or scientists. These notebooks contain curated sessions of contextually-related query operations that together demonstrate interesting hypotheses and conjectures on the data. Unfortunately,relevant such notebooks, that had been prepared on the same dataset, and in light of thesame analysis task, are often nonexistent or unavailable. In this work, we describe ATENA-PRO, a framework for auto-generating such relevant, personalized exploratory sessions. Using a novel specification language, users first describe their desired output notebook. Our language contains dedicated constructs for contextually connecting future output queries. These specifications are then used as input for a Deep Reinforcement Learning (DRL) engine, which auto-generates the personalized notebook. Our DRL engine relies on an existing, general-purpose, DRL framework for data exploration. However, augmenting the generic framework with user specifications requires overcoming a difficult sparsity challenge, as only a small portion of the possible sessions may be compliant with the specifications. Inspired by solutions for constrained reinforcement learning, we devise a compound, flexible reward scheme as well as specification-aware neural network architecture. Our experimental evaluation shows that the combination of these components allows ATENA-PRO to consistently generate interesting, personalized exploration sessions for various analysis tasks and datasets.

Selecting Sub-tables for Data Exploration

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, Amit Somech
Research Paper ICDE, 2023

Abstract

Data scientists frequently examine the raw content of large tables when exploring an unknown dataset. In such cases, small subsets of the full tables (sub-tables) that accurately capture table contents are useful. We present a framework which, given a large data table T, creates a sub-table of small, fixed dimensions by selecting a subset of T’s rows and projecting them over a subset of T’s columns. The question is: Which rows and columns should be selected to yield an informative sub-table?Our first contribution is an informativeness metric for sub-tables with two complementary dimensions: cell coverage, which measures how well the sub-table captures prominent data patterns in T, and diversity. We use association rules as the patterns captured by sub-tables, and show that computing optimal sub-tables directly using this metric is infeasible. We then develop an efficient algorithm that indirectly accounts for association rules using table embedding. The resulting framework produces sub-tables for the full table as well as for the results of queries over the table, enabling the user to quickly understand results and determine subsequent queries. Experimental results show that high-quality sub-tables can be efficiently computed, and verify the soundness of our metrics as well as the usefulness of selected sub-tables through user studies.

FEDEX: An Explainability Framework for Data Exploration Steps

Daniel Deutch, Amir Gilad, Tova Milo, Amit Somech
Research Paper VLDB, 2023

Abstract

When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is based on the contribution of each row to the interestingness of different columns of the entire dataframe, which, in turn, is defined using standard measures such as diversity and exceptionality. Intuitively, interesting rows are ones that explain why (some column of) the analysis query result is interesting as a whole. Rows are correlated in their contribution and so the interesting score for a set of rows may not be directly computed based on that of individual rows. We address the resulting computational challenge by restricting attention to semantically-related sets, based on multiple notions of semantic relatedness; these sets serve as more informative explanations. Our experimental study across multiple real-world datasets shows the usefulness of our system in various scenarios.

SubStrat: A Subset-Based Optimization Strategy for Faster AutoML

Teddy Lazebnik, Amit Somech, Avi Weinberg
Research Paper VLDB, 2023

Abstract

Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high. To this end, we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.

Demonstrating SubStrat: A Subset-Based Strategy for Faster AutoML on Large Datasets

Teddy Lazebnik, Amit Somech
Demo Paper CIKM, 2022

Abstract

Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high. To this end, we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.

SubTab: Data Exploration with Informative Sub-Tables

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, Amit Somech
Demo Paper SIGMOD, 2022

Abstract

We demonstrate SubTab, a framework for creating small, informative sub-tables of large data tables to speed up data exploration. Given a table with n rows and m columns where n and m are large, SubTab creates a sub-table T_sub with k < n rows and l < m columns, i.e. a subset of k rows of the table projected over a subset of l columns. The rows and columns are chosen as representatives of prominent data patterns within and across columns in the input table. SubTab can also be used for query results, enabling the user to quickly understand the results and determine subsequent queries.

ExplainED: Explanations for EDA Notebooks

Daniel Deutch, Amir Gilad, Tova Milo, Amit Somech
Demo Paper VLDB, 2020

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative exploratory sessions that were created by fellow data scientists who examined the same dataset and shared their notebooks via online platforms. Unfortunately, creating an illustrative, well-documented notebook is cumbersome and time-consuming, therefore users sometimes share their notebook without explaining their exploratory steps and their results. Such notebooks are difficult to follow and to understand. To address this, we present ExplainED, a system that automatically attaches explanations to views in EDA notebooks. ExplainED analyzes each view in order to detect what elements thereof are particularly interesting, and produces a corresponding textual explanation. The explanations are generated by first evaluating the interestingness of the given view using several measures capturing different interestingness facets, then computing the Shapely values of the elements in the view, w.r.t. the interestingness measure yielding the highest score. These Shapely values are then used to guide the generation of the textual explanation. We demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented EDA notebooks.

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech
Research Paper SIGMOD, 2020

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

Automating Exploratory Data Analysis via Machine Learning: An Overview

Tova Milo, Amit Somech
Conference Tutorial SIGMOD, 2020

Abstract

Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.

Incremental Based Top-k Similarity Search Framework for Interactive-Data-Analysis Sessions

Oded Elbaz, Tova Milo, Amit Somech
Research Paper EDBT, 2020

Abstract

Interactive Data Analysis (IDA) is a core knowledge-discovery process, in which data scientists explore datasets by issuing a sequence of data analysis actions (e.g. filter, aggregation, visualization), referred to as a session. Since IDA is a challenging task, special recommendation systems were devised in previous work, aimed to assist users in choosing the next analysis action to perform at each point in the session. Such systems often record previous IDA sessions and utilize them to generate next-action recommendations. To do so, a compound, dedicated session-similarity measure is employed to find the top-k sessions most similar to the session of the current user. Clearly, the efficiency of the top-k similarity search is critical to retain interactive response times. However, optimizing this search is challenging due to the non-metric nature of the session similarity measure.

To address this problem we exploit a key property of IDA, which is that the user session progresses incrementally, with the top-k similarity search performed, by the recommender system, at each step. We devise efficient top-k algorithms that harness the incremental nature of the problem to speed up the similarity search, employing a novel, effective filter-and-refine method. Our experiments demonstrate the efficiency of our solution, obtaining a running-time speedup of over 180X compared to a sequential similarity search.

Towards Autonomous, Hands-Free Data Exploration

Ori Bar El, Tova Milo, Amit Somech
Vision Paper CIDR, 2020

Abstract

Exploratory Data Analysis (EDA) is an important yet difficult task, currently performed by expert users, as it requires deep understanding of the data domain as well as profound analytical skills. In this work we make the case for the Hands-Free EDA (HFE) paradigm, in which the exploratory process is automatically conducted, requiring little or no human input as in watching a “video” presenting selected highlights of the dataset. To that end, we suggest an end-to-end visionary system architecture, coupled with a prototype implementation. Our preliminary experimental results demonstrate that HFE is achievable, and leads the way for improvement and optimization research.

ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech
Demo Paper CIKM, 2019

Abstract

Exploratory Data Analysis (EDA), is an important yet challenging task, that requires profound analytical skills and familiarity with the data domain. While Deep Reinforcement Learning (DRL) is nowadays used to solve AI challenges previously considered to be intractable, to our knowledge such solutions have not yet been applied to EDA.

In this work we present ATENA, an autonomous system capable of exploring a given dataset by executing a meaningful sequence of EDA operations. ATENA uses a novel DRL architecture, and learns to perform EDA operations by independently interacting with the dataset, without any training data or human assistance. We demonstrate ATENA in the context of cyber security log analysis, where the audience is invited to partake in a data exploration challenge: explore real-life network logs, assisted by ATENA, in order to reveal underlying security attacks hidden in the data.

Declarative User Selection with Soft Constraints

Yael Amsterdamer, Tova Milo, Amit Somech Brit Youngmann
Research Paper CIKM, 2019

Abstract

In applications with large userbases such as crowdsourcing, social networks or recommender systems, selecting users is a common and challenging task. Different applications require different policies for selecting users, and implementing such policies is application-specific and laborious.

To this end, we introduce a novel declarative framework that abstracts common components of the user selection problem, while allowing for domain-specific tuning. The framework is based on an ontology view of user profiles, with respect to which we define a query language for policy specification. Our language extends SPARQL with means for capturing soft constraints which are essential for worker selection. At the core of our query engine is then a novel efficient algorithm for handling these constraints. Our experimental study on real-life data indicates the effectiveness and flexibility of our approach, showing in particular that it outperforms existing task-specific solutions in prominent user selection tasks.

3 Lessons Learned from Implementing a Deep Reinforcement Learning Framework for Data Exploration

Ori Bar El, Tova Milo,Amit Somech
Workshop Paper AIDB@VLDB, 2019

Abstract

We examine the opportunities and the challenges that stem from implementing a Deep Reinforcement Learning (DRL) framework for Exploratory Data Analysis (EDA). We have dedicated a considerable effort in the design and the development of a DRL system that can autonomously explore a given dataset, by performing an entire sequence of analysis operations that highlight interesting aspects of the data.

In this work, we describe our system design and development process, particularly delving into the major challenges we encountered and eventually overcame. We focus on three important lessons we learned, one for each principal component of the system: (1) Designing a DRL environment for EDA, comprising a machine-readable encoding for analysis operations and result-sets, (2) formulating a reward mechanism for exploratory sessions, then further tuning it to elicit a desired output, and (3) Designing an efficient neural network architecture, capable of effectively choosing between hundreds of thousands of distinct analysis operations. We believe that the lessons we learned may be useful for the databases community members making their first steps in applying DRL techniques to their problem domains.

Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs

Tova Milo, Chai Ozeri, Amit Somech
Research Paper EDBT, 2019

Abstract

Assessing the interestingness of data analysis actions has been the subject of extensive previous work, and a multitude of interestingness measures have been devised, each capturing a different facet of the broad concept. While such measures are a core component in many analysis platforms (e.g., for ranking association rules, recommending visualizations, and query formulation), choosing the most adequate measure for a specific analysis task or an application domain is known to be a difficult task.

In this work we focus on the choice of interestingness measures particularly for Interactive Data Analysis (IDA), where users examine datasets by performing sessions of analysis actions. Our goal is to determine the most suitable interestingness measure that adequately captures the user’s current interest at each step of an interactive analysis session. We propose a novel solution that is based on the mining of IDA session logs. First, we perform an offline analysis of the logs, and identify unique characteristics of interestingness in IDA sessions. We then define a classification problem and build a predictive model that can select the best measure for a given a state of a user session. Our experimental evaluation, performed over real-life session logs, demonstrates the sensibility and adequacy of our approach.

Boosting SimRank with Semantics

Tova Milo, Amit Somech Brit Youngmann
Research Paper EDBT, 2019

Abstract

The problem of estimating the similarity of a pair of nodes in an information network draws extensive interest in numerous fields, e.g., social networks and recommender systems. In this work we revisit SimRank, a popular and well studied similarity measure for information networks, that quantifies the similarity of two nodes based on the similarity of their neighbors. SimRank’s popularity stems from its simple, declarative definition and its efficient, scalable computation. However, despite its wide adaptation, it has been observed that for many applications SimRank may yield inaccurate similarity estimations, due to the fact that it focuses on the network structure and ignores the semantics conveyed in the node/edge labels. Therefore, the question that we ask is can SimRank be enriched with semantics while preserving its advantages?

We answer the question positively and present SemSim, a modular variant of SimRank that allows to inject into the computation any semantic similarly measure, which satisfies three natural conditions. The probabilistic framework that we develop for SemSim is anchored in a careful modification of SimRank’s underlying random surfer model. It employs Importance Sampling along with a novel pruning technique, based on unique properties of SemSim. Our framework yields execution times essentially on par with the (semantic-less) SimRank, while maintaining negligible error rate, and facilitates direct adaptation of existing SimRank optimizations. Our experiments demonstrate the robustness of SemSim, even compared to task-dedicated measures.

SimMeme: A Search Engine for Internet Memes

Tova Milo, Amit Somech Brit Youngmann
Research Paper ICDE, 2019

Abstract

As more and more social network users interact through Internet Memes, an emerging popular type of captioned images, there is a growing need for users to quickly retrieve the right Meme for a given situation. As opposed conventional image search, visually similar Memes may reflect different concepts. Intent is sometimes captured by user annotations (e.g., tags), but these are often incomplete and ambiguous. Thus, a deeper analysis of the relations among Memes is required for an accurate, custom search.

To address this problem, we present SimMeme, a Meme-dedicated search engine. SimMeme uses a generic graph-based data model that aligns various types of information about the Memes with a semantic ontology. A novel similarity measure that effectively considers all incorporated data is employed and serves as the foundation of our system. Our experimental results achieve using common evaluation metrics and crowd feedback, over a large repository of real-life annotated Memes, show that in the task of Meme retrieval, SimMeme outperforms state-of-the-art solutions for image retrieval.

Deep Reinforcement-Learning Framework for Exploratory Data Analysis

Tova Milo, Amit Somech
Workshop Paper AIDM@SIGMOD, 2018

Abstract

Deep Reinforcement Learning (DRL) is unanimously considered as a breakthrough technology, used in solving a growing number of AI challenges previously considered to be intractable. In this work, we aim to set the ground for employing DRL techniques in the context of Exploratory Data Analysis (EDA), an important yet challenging, that is critical in many application domains. We suggest an end-to-end framework architecture, coupled with an initial implementation of each component. The goal of this short paper is to encourage the exploration of DRL models and techniques for facilitating a full-fledged, autonomous solution for EDA.

Next-step Suggestions for Modern Interactive Data Analysis Platforms

Tova Milo, Amit Somech
Research Paper KDD, 2018

Abstract

Modern Interactive Data Analysis (IDA) platforms, such as Kibana, Splunk, and Tableau, are gradually replacing traditional OLAP/SQL tools, as they allow for easy-to-use data exploration, visualization, and mining, even for users lacking SQL and programming skills. Nevertheless, data analysis is still a di cult task, especially for non-expert users. To that end we present REACT, a recommender system designed for modern IDA platforms. In these platforms, analysis sessions interweave high-level actions of multiple types and operate over diverse datasets . REACT identifies and generalizes relevant (previous) sessions to generate personalized next-action suggestions to the user.

We model the user’s analysis context using a generic tree based model, where the edges represent the user’s recent actions, and the nodes represent their result “screens”. A dedicated context-similarity metric is employed for efficient indexing and retrieval of relevant candidate next-actions. These are then generalized to abstract actions that convey common fragments, then adapted to the specific user context. To prove the utility of REACT we performed an extensive online and offline experimental evaluation over real-world analysis logs from the cyber security domain, which we also publish to serve as a benchmark dataset for future work.

December: A Declarative Tool for Crowd Member Selection

Yael Amsterdamer, Tova Milo, Amit Somech, Brit Youngmann
Demo Paper VLDB, 2016

Abstract

Adequate crowd selection is an important factor in the success of crowdsourcing platforms, increasing the quality and relevance of crowd answers and their performance in different tasks. The optimal crowd selection can greatly vary depending on properties of the crowd and of the task. To this end, we present December, a declarative platform with novel capabilities for flexible crowd selection. December supports the personalized selection of crowd members via a dedicated query language Member-QL. This language enables specifying and combining common crowd selection criteria such as properties of a crowd member’s profile and history, similarity between profiles in specific aspects and relevance of the member to a given task. This holistic, customizable approach differs from previous work that has mostly focused on dedicated algorithms for crowd selection in specific settings. To allow efficient query execution, we implement novel algorithms in December based on our generic, semantically aware definitions of crowd member similarity and expertise. We demonstrate the effectiveness of December and MemberQL by using the VLDB community as crowd members and allowing conference participants to choose from among these members for different purposes and in different contexts.

REACT: Context-Sensitive Recommendations for Data Analysis

Tova Milo, Amit Somech
Demo Paper SIGMOD, 2016

Abstract

Data analysis may be a difficult task, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. In this demo we present REACT, a system that hooks to the analysis UI and provides the users with personalized recommendations of analysis actions. By matching the current user session to previous sessions of analysts working with the same or other data sets, REACT is able to identify the potentially best next analysis actions in the given user context. Unlike previous work that mainly focused on individual components of the analysis work, REACT provides a holistic approach that captures a wider range of analysis action types by utilizing novel notions of similarity in terms of the individual actions, the analyzed data and the entire analysis workflow.

We demonstrate the functionality of REACT, as well as its effectiveness through a digital forensics scenario where users are challenged to detect cyber attacks in real life data achieved from honeypot servers.

Managing General and Individual Knowledge in Crowd Mining Applications

Yael Amsterdamer, Susan B. Davidson, Anna Kukliansky, Tova Milo, Slava Novgorodov, and Amit Somech.
Research Paper CIDR, 2015

Abstract

Crowd mining frameworks combine general knowledge, which can refer to an ontology or information in a database, with individual knowledge obtained from the crowd, which captures habits and preferences. To account for such mixed knowledge, along with user interaction and optimization is- sues, such frameworks must employ a complex process of reasoning, automatic crowd task generation and result analysis. In this paper, we describe a generic architecture for crowd mining applications. This architecture allows us to examine and compare the components of existing crowdsourcing systems and point out extensions required by crowd mining. It also highlights new research challenges and potential reuse of existing techniques/components. We exemplify this for the OASSIS project and for other prominent crowdsourcing frameworks.

OASSIS: Query Driven Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,
Research Paper SIGMOD, 2014

Abstract

Crowd data sourcing is increasingly used to gather information from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns. Our approach is based on (1) a simple generic model that captures both ontological knowledge as well as the individual history or habits of crowd members from which frequent patterns are mined; (2) a query language in which users can declaratively specify their information needs and the data patterns of interest; (3) an efficient query evaluation algorithm, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementation of these ideas that mines the crowd through an interactive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.

Ontology Assisted Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,
Demo Paper VLDB, 2014

Abstract

We present OASSIS (for Ontology ASSISted crowd mining), a prototype system which allows users to declaratively specify their information needs, and mines the crowd for answers. The answers that the system computes are concise and relevant, and represent frequent, significant data patterns. The system is based on (1) a generic model that captures both ontological knowledge, as well as the individual knowledge of crowd members from which frequent patterns are mined; (2) a query language in which users can specify their information needs and types of data patterns they seek; and (3) an efficient query evaluation algorithm, for mining semantically concise answers while minimizing the number of questions posed to the crowd.

Currently Teaching

  • 2022 Now

    Database Systems

    Database systems are a crucial to any organization wishing to utilize data. The purpose of the course is to teach the principles of using, designing, and maintaining database systems. The course imparts theoretical and practical knowledge while dealing with challenges such as effective querying of the database, proper planning of the data schema, and handling large-scale data.

  • 2021 Now

    Tabular Data Science

    This course provides an in-depth review of the data scientific pipeline from a data-centric perspective. Focusing on tabular data, we will study several key tasks in the pipeline that facilitate insights and knowledge extraction: from data cleaning, visualization, and pattern mining, to interpretability and explainability of predictive models.
    For each topic, we will begin with a high-level overview, then study one or two representative algorithms/methods, and conclude with a case-study example of how these methods are used for extracting insights. Finally, towards the end of the course, we will discuss whether and how the data scientific pipeline can be automated.

  • 2021 Now

    Advanced Seminar in Automation and ML in Tabular Data Analytics

    Together with the recent, meteoric increase in the amount of tabular data, the need for efficient and rapid analysis of this data has increased as well. More and more organizations, led by "Big Tech" companies, recognize the vast potential in extracting insights and conclusions from their data. In this seminar, we will focus on a young but rapidly evolving field of research, in which scientists, from both academia and industry, develop automated solutions that simplify and facilitate data analysis, science, and mining.

Past Courses

  • 2022 2023

    Introduction to Computer Science

    The course surveys fundamental concepts and methods in computer science, taught in C.

  • 2021 2022

    Data Science Workshop

    The purpose of the workshop is to provide students with practical experience in data science. During the workshop, students will carry out a large-scale project in which they will select a dataset and a prediction task, then perform a complete data science process that includes: defining the prediction problem and evaluation metrics, cleaning the data, selecting and creating features, selecting the right model, and performing parameters tuning. Upon completing a basic prediction model, the students will perform a model quality analysis, then continuously improve the model to obtain better results.

  • 2016 2020

    Workshop in Data Science (TA @Tel Aviv University)

    The workshop focuses on knowledge extraction from raw data, using statistical tools and machine learning algorithms. Participating students are required to design and implement such a system and present their results in class.

  • 2016 2019

    Database Systems (TA @Tel Aviv University)

    The purpose of this course is to provide an introduction to the design and use of database systems. We begin by covering the relational model and the SQL language, then study methods for database design, covering the entity relationship model. Finally, we touch on some advanced topics in database systems. The recitation classes will cover practical topics in database programming.