Amit Somech

About Me

Hi! I am a senior lecturer at the Computer Science Department at Bar-Ilan University.
My research goal is to automate data analysis and science, in order to minimize the time it takes users to gain insights from their datasets. This is done by integrating AI/ML techniques with data management methodologies.

BTW, if that sounds interesting to you, let's talk! I have several open positions for MSc and PhD students (but also undergrads, project-track MSc students, and postdocs!). Please refer to the Students Section and Research Section for more info.

Contact Details

My office is in Building 503, Room 203, Bar-Ilan University

you_wish2get@mymail.com
amitsomech

Research Activity

Since manual, "hands-on" analysis is still the principle method for gaining knowledge from data, my research is focused on building scientific frameworks that reduce the time and effort it takes to explore, analyze, and acquire insights from datasets.

My research methodology integrates methods from data management research with ML concepts such as deep reinforcement learning and representation learning.

My Research Interests

Automated Data Analytics via Deep Reinforcement Learning
Applying LLMs for Data Exploration, Verification & Generation
Explainability for Unsupervised Data Mining

Selected Research Projects

Large language models (LLMs) demonstrate outstanding performance on many NLP tasks. Apart from utilizing the universal representations from pre-trained models in downstream tasks, recent works show the potential of pre-trained masked language models to be factual knowledge bases.

While LLMs train primarily on unstructured, textual corpora, in this new project we aim to analyze the ability of such models to generate and process structured, tabular data. We are currently focusing on the following tasks:

Generating on-demand data tables from LLMs . Given a specification for the desired output table, the goal of our proposed framework is to infer the correct structure of the table and populate it with relevant data items. Although LLMs can naturally generate tables with some success, we find that they tend to miss a significant number of tuples or values, as well as hallucinate incorrect information. We therefore intend to tackle this issues, and design a framework that harvest correct and complete tabular data from LLMs.
Tabular data verification with LLMs . As we rely more on data-driven insights and models, it becomes increasingly crucial to assess the accuracy and reliability of tabular data. In this research, we explore the potential of LLMs in verifying the correctness of data tables: Our main challenge is to effectively detect errors in tables, while minimizing the number of tokens and calls to the LLM.

This new research domain surfaces significant challenges, both in the field of Natural Language Processing (NLP) and in Data Management research, such as deriving a series of queries for facts retrieval, inferring relations between entities, as well as employing relational operators such as filter, join and group-by, data cleaning, and schema matching.

Researchers, data scientists, and business analysts have used rule-based pattern mining tools for over a decade to extract insights from their datasets. Such rule-based insights and their impact have been described in hundreds of academic papers on topics such as E-commerce, social science, biology, cyber security, and health. While the rules themselves are automatically derived from the data, users may then spend days manually examining them – searching for examples in the raw data, understanding their importance to the analytical task at hand, and looking for connections between the rules. This manual analytical work is usually done by employing exploratory queries, data visualizations, other mining operations (e.g., clustering) as well as predictive ML models.

In this new research project, we develop a scientific framework for analyzing and explaining rule-based insights. The framework will include a model for measuring the contribution of individual data elements to the interestingness of rules, as well as notions of explainability and provenance of rules, allowing users to understand why a certain rule appears or is absent from the results. In particular, since users’ interests and information needs may change throughout their analytical process (as they dive deeper into the data), our framework will adapt its output according to previous query results and insights discovered via other analytical operations.
We expect that our framework will reduce a considerable amount of manual analytical labor, assist users to understand patterns in their data faster and more reliably, and thereby expedite discovery.

More often than not, users explore a dataset in light of some analytical task or goal. In this new research project, we develop an automatic system that generates a personalized exploratory notebook for a given dataset and exploration task.
The expected contribution of this project is threefold:

Designing a formal specification language for exploratory sessions (i.e., sequence of multiple analytical operations).
Building an execution engine which auto-generates exploratory sessions that are both interesting as well as specification compliant. This is done via a constrained reinforcement learning, in which a neural-network-based agent is learning an exploration policy that is both useful (defined via a general-purpose exploration reward) and respect the input specifications.
Devising a generative AI solution which effectively construct exploration specifications, given a natural language task description. This component is primarily based on a Large Language Model (LLM).

In future work, we will investigate how LLMs can be utilized within the reinforcement learning architecture, in order to boost performance and generate more compelling and relevant exploration notebooks.

Published works:

ATENA-PRO: Generating Personalized Exploration Notebooks with Constrained Reinforcement Learning

Data exploration is known to be a difficult process that requires analytical skills as well as domain knowledge. Consequently, numerous previous works have been devised to assist users, focusing on the task of composing exploratory “questions” and translate them to viable queries.
However, a great challenge still remains – inspecting the results of each exploratory step, understand what is interesting about it, and draw insights and conclusions.

In this ongoing research project, we develop an explainability framework for data exploration processes, with the goal of inspecting users’ analytical steps and highlighting the elements that made them interesting.
To achieve this goal, we combine methods for interestingness analysis of query results together with explainability techniques, originally suggested for machine learning and artificial intelligence models.

In future work, we intend to employ and utilize causal inference for analyzing exploration steps and extracting insights, as well as LLMs for the generation of natural language sentences coupled with compatible data visualizations.

Published works:

Open Source Code :
FEDEX is implemented into PD-EXPLAIN, a Python library that wraps Pandas and allow users to obtain coherent explanations on dataframe operations.

Many data scientists are using existing Exploratory Data Analysis (EDA) notebooks to get a head start, before exploring the dataset on their own. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential).

This research project aims to develop ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. To do so, we shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA currently uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain useful, initial insights from their datasets.

Published works:

Open Source Code :

In recent years, modern data analytics platforms (e.g. Tableau, Splunk, IBM InfoSphere) are gradually replacing traditional tools and facilitate data exploration even for users lacking knowledge of SQL and programming languages. Yet, data analysis is still a difficult process, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. Users may therefore skip significant analysis actions add overlook important aspects of the data

To address this problem, the goal of this research is developing a framework that provides users with personalized recommendations of relevant data analysis actions. Our framework leverages previous experience of other analysts working with the same or different data, as well the current user context, to generate meaningful analysis recommendations.

Published works:

Publications

A complete list of my publications can be found on DBLP .

EFFECTS: Explorable and Explainable Feature Extraction Framework for Multivariate Time-Series Classification

Ido Ikar, Amit Somech

Demo Paper CIKM, 2023

Abstract

We demonstrate EFFECTS, an automated system for explorable and explainable feature extraction for multivariate time series classification. EFFECTS has a twofold contribution: (1) It significantly facilitates the exploration of MTSC data, and (2) it generates informative yet intuitive and explainable features to be used by the classification model. EFFECTS first mines the MTS data and extracts a set of interpretable features using an optimized transform-slice-aggregate process. To evaluate the quality of EFFECTS features, we gauge how well each feature distinguishes between every two classes, and how well they characterize each single class. Users can then explore the MTS data via the EFFECTS Explorer, which facilitates the visual inspection of important features, dimensions, and time slices. Last, the user can use the top features for each class when building a classification pipeline. We demonstrate EFFECTS on several real-world MTSC datasets, inviting the audience to investigate the data via EFFECTS Explorer and obtain initial insights on the time series data. Then, we will show how EFFECTS features are used in an ML model, and obtain accuracy that is on par with state-of-the-art MTSC models that do not optimize on explainability.

Cluster-Explorer: Explaining Black-Box Clustering Results

Sariel Tutay, Amit Somech

Demo Paper CIKM, 2023

Abstract

Interpreting clustering results is a challenging, manual task, that often requires the user to perform additional analytical queries and visualizations. To this end, we demonstrate Cluster-Explorer, an interactive, easy-to-use framework that provides explanations for black-box clustering results. Cluster-Explorer takes as input the raw dataset alongside cluster labels, and automatically generates multiple coherent explanations that characterize each cluster. We first propose a threefold quality measure that considers the conciseness, cluster coverage, and separation error of an explanation. We tackle the challenge of efficiently computing high-quality explanations using a modified version of a generalized frequent-itemsets mining (gFIM) algorithm. The gFIM algorithm is employed over multiple filter predicates which are extracted by applying various binning methods of different granularities. We implemented Cluster-Explorer as a Python library that can be easily used by data scientists in their ongoing workflows. After employing the clustering pipeline of their choice, Cluster-Explorer opens an integrated, interactive interface for the user to explore the various different explanations for each cluster. In our demonstration, the audience is invited to use Cluster-Explorer on numerous real-life datasets and different clustering pipelines and examine the usefulness of the cluster explanations provided by the system, as well as its efficiency of computation.

ATENA-PRO: Generating Personalized Exploration Notebooks with Constrained Reinforcement Learning

Tavor Lipman, Tova Milo, Amit Somech

Demo Paper SIGMOD, 2023

Abstract

One of the most common, helpful practices of data scientists, when starting the exploration of a given dataset, is to examine existing data exploration notebooks prepared by other data analysts or scientists. These notebooks contain curated sessions of contextually-related query operations that together demonstrate interesting hypotheses and conjectures on the data. Unfortunately,relevant such notebooks, that had been prepared on the same dataset, and in light of thesame analysis task, are often nonexistent or unavailable. In this work, we describe ATENA-PRO, a framework for auto-generating such relevant, personalized exploratory sessions. Using a novel specification language, users first describe their desired output notebook. Our language contains dedicated constructs for contextually connecting future output queries. These specifications are then used as input for a Deep Reinforcement Learning (DRL) engine, which auto-generates the personalized notebook. Our DRL engine relies on an existing, general-purpose, DRL framework for data exploration. However, augmenting the generic framework with user specifications requires overcoming a difficult sparsity challenge, as only a small portion of the possible sessions may be compliant with the specifications. Inspired by solutions for constrained reinforcement learning, we devise a compound, flexible reward scheme as well as specification-aware neural network architecture. Our experimental evaluation shows that the combination of these components allows ATENA-PRO to consistently generate interesting, personalized exploration sessions for various analysis tasks and datasets.

Selecting Sub-tables for Data Exploration

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, Amit Somech

Research Paper ICDE, 2023

Abstract

Data scientists frequently examine the raw content of large tables when exploring an unknown dataset. In such cases, small subsets of the full tables (sub-tables) that accurately capture table contents are useful. We present a framework which, given a large data table T, creates a sub-table of small, fixed dimensions by selecting a subset of T’s rows and projecting them over a subset of T’s columns. The question is: Which rows and columns should be selected to yield an informative sub-table?Our first contribution is an informativeness metric for sub-tables with two complementary dimensions: cell coverage, which measures how well the sub-table captures prominent data patterns in T, and diversity. We use association rules as the patterns captured by sub-tables, and show that computing optimal sub-tables directly using this metric is infeasible. We then develop an efficient algorithm that indirectly accounts for association rules using table embedding. The resulting framework produces sub-tables for the full table as well as for the results of queries over the table, enabling the user to quickly understand results and determine subsequent queries. Experimental results show that high-quality sub-tables can be efficiently computed, and verify the soundness of our metrics as well as the usefulness of selected sub-tables through user studies.

FEDEX: An Explainability Framework for Data Exploration Steps

Daniel Deutch, Amir Gilad, Tova Milo, Amit Somech

Research Paper VLDB, 2023

Abstract

When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is based on the contribution of each row to the interestingness of different columns of the entire dataframe, which, in turn, is defined using standard measures such as diversity and exceptionality. Intuitively, interesting rows are ones that explain why (some column of) the analysis query result is interesting as a whole. Rows are correlated in their contribution and so the interesting score for a set of rows may not be directly computed based on that of individual rows. We address the resulting computational challenge by restricting attention to semantically-related sets, based on multiple notions of semantic relatedness; these sets serve as more informative explanations. Our experimental study across multiple real-world datasets shows the usefulness of our system in various scenarios.

SubStrat: A Subset-Based Optimization Strategy for Faster AutoML

Teddy Lazebnik, Amit Somech, Avi Weinberg

Research Paper VLDB, 2023

Abstract

Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high. To this end, we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.

Demonstrating SubStrat: A Subset-Based Strategy for Faster AutoML on Large Datasets

Teddy Lazebnik, Amit Somech

Demo Paper CIKM, 2022

Abstract

SubTab: Data Exploration with Informative Sub-Tables

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, Amit Somech

Demo Paper SIGMOD, 2022

Abstract

We demonstrate SubTab, a framework for creating small, informative sub-tables of large data tables to speed up data exploration. Given a table with n rows and m columns where n and m are large, SubTab creates a sub-table T_sub with k < n rows and l < m columns, i.e. a subset of k rows of the table projected over a subset of l columns. The rows and columns are chosen as representatives of prominent data patterns within and across columns in the input table. SubTab can also be used for query results, enabling the user to quickly understand the results and determine subsequent queries.

ExplainED: Explanations for EDA Notebooks

Daniel Deutch, Amir Gilad, Tova Milo, Amit Somech

Demo Paper VLDB, 2020

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative exploratory sessions that were created by fellow data scientists who examined the same dataset and shared their notebooks via online platforms. Unfortunately, creating an illustrative, well-documented notebook is cumbersome and time-consuming, therefore users sometimes share their notebook without explaining their exploratory steps and their results. Such notebooks are difficult to follow and to understand. To address this, we present ExplainED, a system that automatically attaches explanations to views in EDA notebooks. ExplainED analyzes each view in order to detect what elements thereof are particularly interesting, and produces a corresponding textual explanation. The explanations are generated by first evaluating the interestingness of the given view using several measures capturing different interestingness facets, then computing the Shapely values of the elements in the view, w.r.t. the interestingness measure yielding the highest score. These Shapely values are then used to guide the generation of the textual explanation. We demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented EDA notebooks.

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech

Research Paper SIGMOD, 2020

Abstract

Exploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative, curated exploratory sessions, on the same dataset, that were created by fellow data scientists who shared them online. Unfortunately, such notebooks are not always available (e.g., if the dataset is new or confidential). To address this, we present ATENA, a system that takes an input dataset and auto-generates a compelling exploratory session, presented in an EDA notebook. We shape EDA into a control problem, and devise a novel Deep Reinforcement Learning (DRL) architecture to effectively optimize the notebook generation. Though ATENA uses a limited set of EDA operations, our experiments show that it generates useful EDA notebooks, allowing users to gain actual insights.

Automating Exploratory Data Analysis via Machine Learning: An Overview

Tova Milo, Amit Somech

Conference Tutorial SIGMOD, 2020

Abstract

Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.

Incremental Based Top-k Similarity Search Framework for Interactive-Data-Analysis Sessions

Oded Elbaz, Tova Milo, Amit Somech

Research Paper EDBT, 2020

Abstract

Interactive Data Analysis (IDA) is a core knowledge-discovery process, in which data scientists explore datasets by issuing a sequence of data analysis actions (e.g. filter, aggregation, visualization), referred to as a session. Since IDA is a challenging task, special recommendation systems were devised in previous work, aimed to assist users in choosing the next analysis action to perform at each point in the session. Such systems often record previous IDA sessions and utilize them to generate next-action recommendations. To do so, a compound, dedicated session-similarity measure is employed to find the top-k sessions most similar to the session of the current user. Clearly, the efficiency of the top-k similarity search is critical to retain interactive response times. However, optimizing this search is challenging due to the non-metric nature of the session similarity measure.

To address this problem we exploit a key property of IDA, which is that the user session progresses incrementally, with the top-k similarity search performed, by the recommender system, at each step. We devise efficient top-k algorithms that harness the incremental nature of the problem to speed up the similarity search, employing a novel, effective filter-and-refine method. Our experiments demonstrate the efficiency of our solution, obtaining a running-time speedup of over 180X compared to a sequential similarity search.

Towards Autonomous, Hands-Free Data Exploration

Ori Bar El, Tova Milo, Amit Somech

Vision Paper CIDR, 2020

Abstract

Exploratory Data Analysis (EDA) is an important yet difficult task, currently performed by expert users, as it requires deep understanding of the data domain as well as profound analytical skills. In this work we make the case for the Hands-Free EDA (HFE) paradigm, in which the exploratory process is automatically conducted, requiring little or no human input as in watching a “video” presenting selected highlights of the dataset. To that end, we suggest an end-to-end visionary system architecture, coupled with a prototype implementation. Our preliminary experimental results demonstrate that HFE is achievable, and leads the way for improvement and optimization research.

ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning

Ori Bar El, Tova Milo, Amit Somech

Demo Paper CIKM, 2019

Abstract

Exploratory Data Analysis (EDA), is an important yet challenging task, that requires profound analytical skills and familiarity with the data domain. While Deep Reinforcement Learning (DRL) is nowadays used to solve AI challenges previously considered to be intractable, to our knowledge such solutions have not yet been applied to EDA.

In this work we present ATENA, an autonomous system capable of exploring a given dataset by executing a meaningful sequence of EDA operations. ATENA uses a novel DRL architecture, and learns to perform EDA operations by independently interacting with the dataset, without any training data or human assistance. We demonstrate ATENA in the context of cyber security log analysis, where the audience is invited to partake in a data exploration challenge: explore real-life network logs, assisted by ATENA, in order to reveal underlying security attacks hidden in the data.

Declarative User Selection with Soft Constraints

Yael Amsterdamer, Tova Milo, Amit Somech Brit Youngmann

Research Paper CIKM, 2019

Abstract

In applications with large userbases such as crowdsourcing, social networks or recommender systems, selecting users is a common and challenging task. Different applications require different policies for selecting users, and implementing such policies is application-specific and laborious.

To this end, we introduce a novel declarative framework that abstracts common components of the user selection problem, while allowing for domain-specific tuning. The framework is based on an ontology view of user profiles, with respect to which we define a query language for policy specification. Our language extends SPARQL with means for capturing soft constraints which are essential for worker selection. At the core of our query engine is then a novel efficient algorithm for handling these constraints. Our experimental study on real-life data indicates the effectiveness and flexibility of our approach, showing in particular that it outperforms existing task-specific solutions in prominent user selection tasks.

3 Lessons Learned from Implementing a Deep Reinforcement Learning Framework for Data Exploration

Ori Bar El, Tova Milo,Amit Somech

Workshop Paper AIDB@VLDB, 2019

Abstract

We examine the opportunities and the challenges that stem from implementing a Deep Reinforcement Learning (DRL) framework for Exploratory Data Analysis (EDA). We have dedicated a considerable effort in the design and the development of a DRL system that can autonomously explore a given dataset, by performing an entire sequence of analysis operations that highlight interesting aspects of the data.

In this work, we describe our system design and development process, particularly delving into the major challenges we encountered and eventually overcame. We focus on three important lessons we learned, one for each principal component of the system: (1) Designing a DRL environment for EDA, comprising a machine-readable encoding for analysis operations and result-sets, (2) formulating a reward mechanism for exploratory sessions, then further tuning it to elicit a desired output, and (3) Designing an efficient neural network architecture, capable of effectively choosing between hundreds of thousands of distinct analysis operations. We believe that the lessons we learned may be useful for the databases community members making their first steps in applying DRL techniques to their problem domains.

Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs

Tova Milo, Chai Ozeri, Amit Somech

Research Paper EDBT, 2019

Abstract

Assessing the interestingness of data analysis actions has been the subject of extensive previous work, and a multitude of interestingness measures have been devised, each capturing a different facet of the broad concept. While such measures are a core component in many analysis platforms (e.g., for ranking association rules, recommending visualizations, and query formulation), choosing the most adequate measure for a specific analysis task or an application domain is known to be a difficult task.

In this work we focus on the choice of interestingness measures particularly for Interactive Data Analysis (IDA), where users examine datasets by performing sessions of analysis actions. Our goal is to determine the most suitable interestingness measure that adequately captures the user’s current interest at each step of an interactive analysis session. We propose a novel solution that is based on the mining of IDA session logs. First, we perform an offline analysis of the logs, and identify unique characteristics of interestingness in IDA sessions. We then define a classification problem and build a predictive model that can select the best measure for a given a state of a user session. Our experimental evaluation, performed over real-life session logs, demonstrates the sensibility and adequacy of our approach.

Boosting SimRank with Semantics

Tova Milo, Amit Somech Brit Youngmann

Research Paper EDBT, 2019

Abstract

The problem of estimating the similarity of a pair of nodes in an information network draws extensive interest in numerous fields, e.g., social networks and recommender systems. In this work we revisit SimRank, a popular and well studied similarity measure for information networks, that quantifies the similarity of two nodes based on the similarity of their neighbors. SimRank’s popularity stems from its simple, declarative definition and its efficient, scalable computation. However, despite its wide adaptation, it has been observed that for many applications SimRank may yield inaccurate similarity estimations, due to the fact that it focuses on the network structure and ignores the semantics conveyed in the node/edge labels. Therefore, the question that we ask is can SimRank be enriched with semantics while preserving its advantages?

We answer the question positively and present SemSim, a modular variant of SimRank that allows to inject into the computation any semantic similarly measure, which satisfies three natural conditions. The probabilistic framework that we develop for SemSim is anchored in a careful modification of SimRank’s underlying random surfer model. It employs Importance Sampling along with a novel pruning technique, based on unique properties of SemSim. Our framework yields execution times essentially on par with the (semantic-less) SimRank, while maintaining negligible error rate, and facilitates direct adaptation of existing SimRank optimizations. Our experiments demonstrate the robustness of SemSim, even compared to task-dedicated measures.

SimMeme: A Search Engine for Internet Memes

Tova Milo, Amit Somech Brit Youngmann

Research Paper ICDE, 2019

Abstract

As more and more social network users interact through Internet Memes, an emerging popular type of captioned images, there is a growing need for users to quickly retrieve the right Meme for a given situation. As opposed conventional image search, visually similar Memes may reflect different concepts. Intent is sometimes captured by user annotations (e.g., tags), but these are often incomplete and ambiguous. Thus, a deeper analysis of the relations among Memes is required for an accurate, custom search.

To address this problem, we present SimMeme, a Meme-dedicated search engine. SimMeme uses a generic graph-based data model that aligns various types of information about the Memes with a semantic ontology. A novel similarity measure that effectively considers all incorporated data is employed and serves as the foundation of our system. Our experimental results achieve using common evaluation metrics and crowd feedback, over a large repository of real-life annotated Memes, show that in the task of Meme retrieval, SimMeme outperforms state-of-the-art solutions for image retrieval.

Deep Reinforcement-Learning Framework for Exploratory Data Analysis

Tova Milo, Amit Somech

Workshop Paper AIDM@SIGMOD, 2018

Abstract

Deep Reinforcement Learning (DRL) is unanimously considered as a breakthrough technology, used in solving a growing number of AI challenges previously considered to be intractable. In this work, we aim to set the ground for employing DRL techniques in the context of Exploratory Data Analysis (EDA), an important yet challenging, that is critical in many application domains. We suggest an end-to-end framework architecture, coupled with an initial implementation of each component. The goal of this short paper is to encourage the exploration of DRL models and techniques for facilitating a full-fledged, autonomous solution for EDA.

Next-step Suggestions for Modern Interactive Data Analysis Platforms

Tova Milo, Amit Somech

Research Paper KDD, 2018

Abstract

Modern Interactive Data Analysis (IDA) platforms, such as Kibana, Splunk, and Tableau, are gradually replacing traditional OLAP/SQL tools, as they allow for easy-to-use data exploration, visualization, and mining, even for users lacking SQL and programming skills. Nevertheless, data analysis is still a di cult task, especially for non-expert users. To that end we present REACT, a recommender system designed for modern IDA platforms. In these platforms, analysis sessions interweave high-level actions of multiple types and operate over diverse datasets . REACT identifies and generalizes relevant (previous) sessions to generate personalized next-action suggestions to the user.

We model the user’s analysis context using a generic tree based model, where the edges represent the user’s recent actions, and the nodes represent their result “screens”. A dedicated context-similarity metric is employed for efficient indexing and retrieval of relevant candidate next-actions. These are then generalized to abstract actions that convey common fragments, then adapted to the specific user context. To prove the utility of REACT we performed an extensive online and offline experimental evaluation over real-world analysis logs from the cyber security domain, which we also publish to serve as a benchmark dataset for future work.

December: A Declarative Tool for Crowd Member Selection

Yael Amsterdamer, Tova Milo, Amit Somech, Brit Youngmann

Demo Paper VLDB, 2016

Abstract

Adequate crowd selection is an important factor in the success of crowdsourcing platforms, increasing the quality and relevance of crowd answers and their performance in different tasks. The optimal crowd selection can greatly vary depending on properties of the crowd and of the task. To this end, we present December, a declarative platform with novel capabilities for flexible crowd selection. December supports the personalized selection of crowd members via a dedicated query language Member-QL. This language enables specifying and combining common crowd selection criteria such as properties of a crowd member’s profile and history, similarity between profiles in specific aspects and relevance of the member to a given task. This holistic, customizable approach differs from previous work that has mostly focused on dedicated algorithms for crowd selection in specific settings. To allow efficient query execution, we implement novel algorithms in December based on our generic, semantically aware definitions of crowd member similarity and expertise. We demonstrate the effectiveness of December and MemberQL by using the VLDB community as crowd members and allowing conference participants to choose from among these members for different purposes and in different contexts.

REACT: Context-Sensitive Recommendations for Data Analysis

Tova Milo, Amit Somech

Demo Paper SIGMOD, 2016

Abstract

Data analysis may be a difficult task, especially for non-expert users, as it requires deep understanding of the investigated domain and the particular context. In this demo we present REACT, a system that hooks to the analysis UI and provides the users with personalized recommendations of analysis actions. By matching the current user session to previous sessions of analysts working with the same or other data sets, REACT is able to identify the potentially best next analysis actions in the given user context. Unlike previous work that mainly focused on individual components of the analysis work, REACT provides a holistic approach that captures a wider range of analysis action types by utilizing novel notions of similarity in terms of the individual actions, the analyzed data and the entire analysis workflow.

We demonstrate the functionality of REACT, as well as its effectiveness through a digital forensics scenario where users are challenged to detect cyber attacks in real life data achieved from honeypot servers.

Managing General and Individual Knowledge in Crowd Mining Applications

Yael Amsterdamer, Susan B. Davidson, Anna Kukliansky, Tova Milo, Slava Novgorodov, and Amit Somech.

Research Paper CIDR, 2015

Abstract

Crowd mining frameworks combine general knowledge, which can refer to an ontology or information in a database, with individual knowledge obtained from the crowd, which captures habits and preferences. To account for such mixed knowledge, along with user interaction and optimization is- sues, such frameworks must employ a complex process of reasoning, automatic crowd task generation and result analysis. In this paper, we describe a generic architecture for crowd mining applications. This architecture allows us to examine and compare the components of existing crowdsourcing systems and point out extensions required by crowd mining. It also highlights new research challenges and potential reuse of existing techniques/components. We exemplify this for the OASSIS project and for other prominent crowdsourcing frameworks.

OASSIS: Query Driven Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,

Research Paper SIGMOD, 2014

Abstract

Crowd data sourcing is increasingly used to gather information from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns. Our approach is based on (1) a simple generic model that captures both ontological knowledge as well as the individual history or habits of crowd members from which frequent patterns are mined; (2) a query language in which users can declaratively specify their information needs and the data patterns of interest; (3) an efficient query evaluation algorithm, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementation of these ideas that mines the crowd through an interactive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.

Ontology Assisted Crowd Mining

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov , Amit Somech,

Demo Paper VLDB, 2014

Abstract

We present OASSIS (for Ontology ASSISted crowd mining), a prototype system which allows users to declaratively specify their information needs, and mines the crowd for answers. The answers that the system computes are concise and relevant, and represent frequent, significant data patterns. The system is based on (1) a generic model that captures both ontological knowledge, as well as the individual knowledge of crowd members from which frequent patterns are mined; (2) a query language in which users can specify their information needs and types of data patterns they seek; and (3) an efficient query evaluation algorithm, for mining semantically concise answers while minimizing the number of questions posed to the crowd.

Teaching

I currently teach at the Computer Science Department, Bar-Ilan University.

Office hours by appointment.

Currently Teaching

2022 Now

Database Systems

Database systems are a crucial to any organization wishing to utilize data. The purpose of the course is to teach the principles of using, designing, and maintaining database systems. The course imparts theoretical and practical knowledge while dealing with challenges such as effective querying of the database, proper planning of the data schema, and handling large-scale data.
2021 Now

Tabular Data Science

This course provides an in-depth review of the data scientific pipeline from a data-centric perspective. Focusing on tabular data, we will study several key tasks in the pipeline that facilitate insights and knowledge extraction: from data cleaning, visualization, and pattern mining, to interpretability and explainability of predictive models.
For each topic, we will begin with a high-level overview, then study one or two representative algorithms/methods, and conclude with a case-study example of how these methods are used for extracting insights. Finally, towards the end of the course, we will discuss whether and how the data scientific pipeline can be automated.
2021 Now

Advanced Seminar in Automation and ML in Tabular Data Analytics

Together with the recent, meteoric increase in the amount of tabular data, the need for efficient and rapid analysis of this data has increased as well. More and more organizations, led by "Big Tech" companies, recognize the vast potential in extracting insights and conclusions from their data. In this seminar, we will focus on a young but rapidly evolving field of research, in which scientists, from both academia and industry, develop automated solutions that simplify and facilitate data analysis, science, and mining.

Past Courses

2022 2023

Introduction to Computer Science

The course surveys fundamental concepts and methods in computer science, taught in C.
2021 2022

Data Science Workshop

The purpose of the workshop is to provide students with practical experience in data science. During the workshop, students will carry out a large-scale project in which they will select a dataset and a prediction task, then perform a complete data science process that includes: defining the prediction problem and evaluation metrics, cleaning the data, selecting and creating features, selecting the right model, and performing parameters tuning. Upon completing a basic prediction model, the students will perform a model quality analysis, then continuously improve the model to obtain better results.
2016 2020

Workshop in Data Science (TA @Tel Aviv University)

The workshop focuses on knowledge extraction from raw data, using statistical tools and machine learning algorithms. Participating students are required to design and implement such a system and present their results in class.
2016 2019

Database Systems (TA @Tel Aviv University)

The purpose of this course is to provide an introduction to the design and use of database systems. We begin by covering the relational model and the SQL language, then study methods for database design, covering the entity relationship model. Finally, we touch on some advanced topics in database systems. The recitation classes will cover practical topics in database programming.

Students

Our lab is always looking for bright and motivated people. We have several open slots for MSc and PhD candidates, as well as postdoctoral research positions. We are working on various, interesting research topics, all revolving around better utilizing and understanding data.

We also seek undergrads and project-track MSc students with proven engineering and programming skills, to help us design and build useful open-source implementations of our research frameworks. See for example pd-explain and SubStrat-AutoML.

If this sounds interesting to you, please review the research projects listed in the Research Page, and briefly read a paper or two from the Publication Page . If still interested, send me an email so we can schedule a talk ^_^.
Not sure this is 100% for you? Feel free to reach out anyways.

Computer Science Department,Bar-Ilan University

About Me

Amit Somech

Contact Details

Research Activity

My Research Interests

Selected Research Projects

Extracting and Processing Tabular Data from Large Language Models (new, joint research with Oren Glickman)

Explainability Framework for Rule-Based Data Insights (new, joint research with Susan B. Davidson )

Generating Personalized Exploration Notebooks Using LLMs and Constrained Reinforcement Learning (new)

Explainability Framework for Data Exploration

A Deep Reinforcement Learning Framework for Data Exploration

Recommendation System for Interactive Data Analysis

Publications

EFFECTS: Explorable and Explainable Feature Extraction Framework for Multivariate Time-Series Classification

Abstract

Cluster-Explorer: Explaining Black-Box Clustering Results

Abstract

ATENA-PRO: Generating Personalized Exploration Notebooks with Constrained Reinforcement Learning

Abstract

Selecting Sub-tables for Data Exploration

Abstract

FEDEX: An Explainability Framework for Data Exploration Steps

Abstract

SubStrat: A Subset-Based Optimization Strategy for Faster AutoML

Abstract

Demonstrating SubStrat: A Subset-Based Strategy for Faster AutoML on Large Datasets

Abstract

SubTab: Data Exploration with Informative Sub-Tables

Abstract

ExplainED: Explanations for EDA Notebooks

Abstract

Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning

Abstract

Automating Exploratory Data Analysis via Machine Learning: An Overview

Abstract

Incremental Based Top-k Similarity Search Framework for Interactive-Data-Analysis Sessions

Abstract

Towards Autonomous, Hands-Free Data Exploration

Abstract

ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning

Abstract

Declarative User Selection with Soft Constraints

Abstract

3 Lessons Learned from Implementing a Deep Reinforcement Learning Framework for Data Exploration

Abstract

Predicting "What is Interesting" by Mining Interactive-Data-Analysis Session Logs

Abstract

Boosting SimRank with Semantics

Abstract

SimMeme: A Search Engine for Internet Memes

Abstract

Deep Reinforcement-Learning Framework for Exploratory Data Analysis

Abstract

Next-step Suggestions for Modern Interactive Data Analysis Platforms

Abstract

December: A Declarative Tool for Crowd Member Selection

Abstract

REACT: Context-Sensitive Recommendations for Data Analysis

Abstract

Managing General and Individual Knowledge in Crowd Mining Applications

Abstract

OASSIS: Query Driven Crowd Mining

Abstract

Ontology Assisted Crowd Mining

Abstract

Teaching

Currently Teaching

Database Systems

Tabular Data Science

Advanced Seminar in Automation and ML in Tabular Data Analytics

Past Courses

Introduction to Computer Science

Data Science Workshop

Workshop in Data Science (TA @Tel Aviv University)

Database Systems (TA @Tel Aviv University)

Students

Computer Science Department,
Bar-Ilan University