I'm a researcher at Xerox Research in France.
In 2011 I completed my PhD studies at the NLP lab at the Computer Science Department, Bar-Ilan University.
Since 2012 I've been based in Grenoble, France, working mostly on statistical machine translation (SMT) and on textual entailment (sometimes even together).
My PhD research was in the field of Natural Language Processing and was done under the instruction of Prof. Ido Dagan. I was working on applied semantic inference, under the Textual Entailment framework, exploring
various aspects of textual entailment, including knowledge acquisition, discourse and contextual models, as well as the utilization of textual entailment in NLP applications, such as SMT or text categorization.
Shachar Mirkin and Laurent Besacier.
Data Selection for Compact Adapted Models in Statistical Machine Translation.
Data selection is a common technique for adapting statistical translation models for a specific
domain, which has been shown to both improve translation quality and to reduce model size.
Selection relies on some in-domain data, of the same domain of the texts expected to be translated.
Selecting the sentence-pairs that are most similar to the in-domain data from a pool of
parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in
a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain
data that is available in advance. Some methods select additional data based on the
actual text that needs to be translated. While useful, this is not always a practical scenario.
In this work we describe an extensive exploration of data selection techniques over Arabic to
French datasets, and propose methods to address both similarity and coverage considerations
while maintaining a limited model size.
Incrementally Updating the SMT Reordering Model.
Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman and Shachar Mirkin.
Comparison of Data Selection Techniques for the Translation of Video Lectures.
Anand Gupta, Manpreet Kathuria, Adarsh Singh, Aseem Goyal, Shachar Mirkin.
Text Summarization through Entailment-based Minimum Vertex Cover.
Sentence Connectivity is a textual characteristic that may be incorporated intelligently for the selection of sentences of a well meaning summary. However, the existing
summarization methods do not utilize its potential fully. The present paper introduces a novel method for single document text summarization. It poses the text summarization task as an optimization
problem, and attempts to solve it using Weighted Minimum Vertex Cover (wMVC), a graph-based algorithm. Textual entailment, an established indicator of semantic relationships between text units, is used to measure sentence connectivity and construct the graph on which wMVC operates. Experiments on a standard summarization dataset show that the suggested algorithm outperforms related methods.
Shachar Mirkin and Nicola Cancedda.
Assessing Quick Update Methods of Statistical Translation Models
The ability to quickly incorporate incoming training data into a running translation system is critical in a number of applications. Mechanisms based on incremental model update and the online EM algorithm hold the promise of achieving this objective in a principled way. Still, efficient tools for incremental training are yet to be available. In this paper we experiment with simple alternative solutions for interim model updates, within the popular Moses system. Short of updating the model in real time, such updates can execute in short timeframes even when operating on large models, and achieve a performance level close to, and in some cases exceeding, that of batch retraining.
William Darling, Cédric Archambeau, Shachar Mirkin and Guillaume Bouchard.
Error Prediction with Partial Feedback.
In this paper, we propose a probabilistic framework for predicting the root causes of errors in data processing pipelines made up of several components when we only have access to partial feedback; that is, we are aware when some error has occurred in one or more of the components, but we do not know which one. The proposed error model enables us to direct the user feedback to the correct components in the pipeline to either automatically correct errors as they occur, retrain the component with assimilated training examples, or take other corrective action. We present the model and describe an Expectation Maximization (EM)-based algorithm to learn the model parameters and predict the error configuration. We demonstrate the accuracy and usefulness of our method first on synthetic data, and then on two distinct tasks: error correction in a 2-component opinion summarization system, and phrase error detection in statistical machine translation.
Shachar Mirkin, Sriram Venkatapathy and Marc Dymetman. Confidence-driven Rewriting for Improved Translation.
MT Summit 2013.
Some source texts are more difficult to translate than others. One way to handle such texts is to modify them prior to translation. Yet, a prominent factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. We present an approach, and an interactive tool implementing it, where source sentences are rewritten in order to maximize confidence estimates with respect to the translation model. The automatically-generated rewritings are then proposed for the user’s approval. Such an approach can reduce post-editing effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
Shachar Mirkin, Sriram Venkatapathy, Marc Dymetman and Ioan Calapodescu. SORT: Interactive Source-Rewriting for Improved Translation.
ACL 2013 System Demonstration.
The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s approval. Such a system can reduce post-editing effort, replacing it by cost-effective pre-editing that can be done by monolinguals.
Sriram Venkatapathy and Shachar Mirkin. 2012.
An SMT-driven Authoring Tool.
This paper presents a tool for assisting users in composing texts in a language they do not know. While Machine Translation (MT) is pretty useful for understanding texts in an unfamiliar language, current MT technology has yet to reach the stage where it can be used reliably without a post-editing step. This work attempts to make a step towards achieving this goal. We propose a tool that provides suggestions for the continuation of the text in the source language (language that the user knows), thus creating texts that can be translated to the target language (language that the user does not know). In terms of functionality, our tool resembles text prediction applications. However , the target language, through a Statistical Machine Translation (SMT) model, drives the composition and not only the source language. We present the user interface and describe the considerations that underline the suggestion process. A simulation of user interaction shows that composition speed can be substantially reduced and provides initial positive feedback as to the ability to generate better translations.
Shachar Mirkin. 2011. Context and Discourse in Textual Entailment Inference.
PhD Thesis. Department of Computer Science, Bar-Ilan University.
Shachar Mirkin, Ido Dagan, Lili Kotlerman and Idan Szpektor. 2011.
Classification-based Contextual Preferences.
This paper addresses context matching in textual
inference. We formulate the task under
the Contextual Preferences framework which
broadly captures contextual aspects of inference.
We propose a generic classificationbased
scheme under this framework which coherently
attends to context matching in inference
and may be employed in any inferencebased
task. As a test bed for our scheme we use
the Name-based Text Categorization (TC) task.
We define an integration of Contextual Preferences
into the TC setting and present a concrete
self-supervised model which instantiates the
generic scheme and is applied to address context
matching in the TC task. Experiments on
standard TC datasets show that our approach
outperforms the state of the art in context modeling
for Name-based TC.
Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido Dagan. 2011.
Knowledge and Tree-Edits in Learnable Entailment Proofs.
Text Analysis Conference (RTE-7).
This paper describes BIUTEE - Bar Ilan University Textual Entailment Engine. BIUTEE is
a natural language inference system in which
the hypothesis is proven by the text, based on
linguistic- and world- knowledge resources, as
well as syntactically motivated tree transformations. The main progress in BIUTEE in
the last year is a new conﬁdence model that
estimates the validity of the proof found by BIUTEE.
Asher Stern, Eyal Shnarch, Amnon Lotan, Shachar Mirkin, Lili Kotlerman, Naomi Zeichner, Jonathan Berant and Ido Dagan. 2010.
Rule Chaining and Approximate Match in textual inference.
Text Analysis Conference (RTE-6)
This paper describes the participation of Bar-Ilan university in the sixth RTE challenge. Our
textual-entailment engine, BiuTee , was enhanced with new components that introduce chaining
of lexical-entailment rules, and tackle the problem of approximately matching the text and the hypothesis
after all available knowledge of entailment rules was utilized. We have also re-engineered
our system aiming at an open-source open architecture. BiuTee's performance is better than the
median of all-submissions, and outperforms significantly an IR-oriented baseline.
Shachar Mirkin, Jonathan Berant, Ido Dagan and Eyal Shnarch. 2010. Recognising Entailment within Discourse.
Texts are commonly interpreted based on
the entire discourse in which they are situated.
Discourse processing has been
shown useful for inference-based application;
yet, most systems for textual entailment
– a generic paradigm for applied inference
– have only addressed discourse
considerations via off-the-shelf coreference
resolvers. In this paper we explore
various discourse aspects in entailment inference,
suggest initial solutions for them
and investigate their impact on entailment
performance. Our experiments suggest
that discourse provides useful information,
which significantly improves entailment
inference, and should be better addressed
by future entailment systems.
Shachar Mirkin, Ido Dagan and Sebastian Padó. 2010. Assessing the Role of Discourse References in Entailment Inference.
Proceedings of ACL.
Discourse references, notably coreference and bridging, play an important role in many text understanding applications, but their impact on textual entailment is yet to be systematically understood. On the basis
of an in-depth analysis of entailment instances, we argue that discourse references have the potential of substantially
improving textual entailment recognition, and identify a number of research directions towards this goal.
Wilker Aziz, Marc Dymetman, Shachar Mirkin, Lucia Specia, Nicola Cancedda and Ido Dagan. 2010.
Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words.
Proceedings of EAMT.
We present a general method for incorporating an
“expert” model into a Statistical Machine Translation
(SMT) system, in order to improve its performance
on a particular “area of expertise”, and apply
this method to the specific task of finding adequate
replacements for Out-of-Vocabulary (OOV)
words. Candidate replacements are paraphrases
and entailed phrases, obtained using monolingual
resources. These candidate replacements are
transformed into “dynamic biphrases”, generated
at decoding time based on the context of each
source sentence. Standard SMT features are enhanced
with a number of new features aimed at
scoring translations produced by using different
replacements. Active learning is used to discriminatively
train the model parameters from human
assessments of the quality of translations. The
learning framework yields an SMT system which
is able to deal with sentences containing OOV
words but also guarantees that the performance
is not degraded for input sentences without OOV
words. Results of experiments on English-French
translation show that this method outperforms previous
work addressing OOV words in terms of acceptability.
Azad Abad, Luisa Bentivogli, Ido Dagan, Danilo Giampiccolo, Shachar Mirkin, Emanuele Pianta and Asher Stern. 2010. A Resource for Investigating the Impact of Anaphora and Coreference on Inference.
Proceedings of LREC.
Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated. A by-product of the annotation is a new “augmented” data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.
Shachar Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Dagan, Eyal Shnarch, Asher Stern and Idan Szpektor. 2009.
Addressing Discourse and Document Structure in the RTE Search Task
. TAC (RTE-5 Search task submission report -- ranked 1st).
This paper describes Bar-Ilan University's submissions to RTE-5. This year we focused on the Search pilot, enhancing our entailment system to address two main issues introduced by this new setting: scalability and, primarily, document-level discourse. Our system achieved the highest score on the Search task amongst participating groups, and proposes first steps towards addressing this challenging setting.
Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan,
Marc Dymetman and Idan Szpektor. 2009. Source-Language
Entailment Modeling for Translating Unknown Terms
. Proceedings of ACL-IJCNLP.
This paper addresses the task of handling unknown terms in SMT.
We propose using source-language monolingual models and resources to paraphrase
the source text prior to translation. We further present a conceptual extension
to prior work by allowing translations of entailed texts rather than paraphrases
only. A method for performing this process efficiently is presented and applied
to some 2500 sentences with unknown terms. Our experiments show that the
proposed approach substantially increases the number of properly translated
Shachar Mirkin, Ido Dagan, Eyal Shnarch. 2009. Evaluating the Inferential Utility of Lexical-Semantic
Proceedings of EACL.
Lexical-semantic resources are used extensively for
applied semantic inference, yet a clear quantitative picture of their current
utility and limitations is largely missing. We propose system- and
application-independent evaluation and analysis methodologies for resources’
performance, and systematically apply them to seven prominent resources. Our
findings identify the currently limited recall of available resources, and
indicate the potential to improve performance by examining non-standard relation
types and by distilling the output of distributional methods. Further, our
results stress the need to include auxiliary information regarding the lexical
and logical contexts in which a lexical inference is valid, as well as its prior
Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin,
Eyal Shnarch, Idan Szpektor. Efficient Semantic
Deduction and Approximate Matching over Compact Parse Forests
. 2008. Text
Analysis Conference (TAC).
Semantic inference is often modeled as application of
entailment rules, which specify generation of entailed sentences from a source
sentence. Efficient generation and representation of entailed consequents is a
fundamental problem common to such inference methods. We present a new data
structure, termed compact forest, which allows efficient generation and
representation of entailed consequents, each represented as a parse tree.
Rule-based inference is complemented with a new approximate matching measure
inspired by tree kernels, which is computed efficiently over compact forests.
Our system also makes use of novel large-scale entailment rule bases, derived
fromWikipedia as well as from information about predicates and their argument
mapping, gathered from available lexicons and complemented by unsupervised
Shachar Mirkin, Ido Dagan, Maayan Geffet. 2006. Integrating Pattern-Based and Distributional Similarity Methods
for Lexical Entailment Acquisition.
This paper addresses the problem of acquiring lexical
semantic relationships, applied to the lexical entailment relation. Our main
contribution is a novel conceptual integration between the two distinct
acquisition paradigms for lexical relations – the patternbased and the
distributional similarity approaches. The integrated method exploits mutual
complementary information of the two approaches to obtain candidate relations
and informative characterizing features. Then, a small size training set is used
to construct a more accurate supervised classifier, showing significant increase
in both recall and precision over the original approaches.
Shachar Mirkin. 2006. MSc Thesis. Integrating
Pattern-Based and Distributional Similarity Methods for Lexical Entailment
School of Computer Science and Engineering, the Hebrew
University of Jerusalem.
Model-aware improvement of source translatability. Laboratoire d'Informatique de Grenoble (LIG). June 2013
Textual entailment inference in machine translation (with Ido Dagan). Workshop of Machine Translation and Morphologically-rich Languages. January 2011. [slides]
Incorporating Discourse Information within Textual Entailment Inference. Invited talk at the Institute of Formal and Applied Linguistics, Charles University, Prague. November 2010.
Context Models for Textual Entailment and their Application to Statistical Machine Translation. PASCAL2 Pump Priming and Thematic Programme Workshop September 2009. Bled, Slovenia.
Evaluating the Inferential Utility of Lexical-Semantic Resources. BISFAI-09.
Source-Language Entailment Modeling for Translating Unknown Terms. BISFAI-09.
Introduction to Chinese NLP. The
Eight Annual Conference of Asian Studies in Israel. June 2009.
Textual Entailment. Xerox Research Centre Europe. June 2008.
Chinese Word Segmentation. The Fourth Annual Conference of Asian Studies in Israel. May 2005.
University of Grenoble Alps, Xerox Research, Amdocs, Intel, Interwise & ClearForest in Israel, Chine and France.
For a more organized course of events see: