Shachar Mirkin   -   שחר מירקין

[ Contact | Publications | Talks | Resources | Language learning links | עברית ]
Shachar Mirkin   

How to pronounce my name

I'm a researcher at Xerox Research in France.
In 2011 I completed my PhD studies at the NLP lab at the Computer Science Department, Bar-Ilan University.
Since 2012 I've been based in Grenoble, France, working mostly on statistical machine translation (SMT) and on textual entailment (sometimes even together).

My PhD research was in the field of Natural Language Processing and was done under the instruction of Prof. Ido Dagan. I was working on applied semantic inference, under the Textual Entailment framework, exploring various aspects of textual entailment, including knowledge acquisition, discourse and contextual models, as well as the utilization of textual entailment in NLP applications, such as SMT or text categorization.

Contact me
my email   

   View Shachar Mirkin's profile on LinkedIn

New  Shachar Mirkin and Laurent Besacier. Data Selection for Compact Adapted Models in Statistical Machine Translation. AMTA 2014.  PDF  Bib entry
Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

New  Shachar Mirkin. Incrementally Updating the SMT Reordering Model. PACLIC 2014.

New  Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman and Shachar Mirkin. Comparison of Data Selection Techniques for the Translation of Video Lectures. AMTA 2014.

New  Anand Gupta, Manpreet Kathuria, Adarsh Singh, Aseem Goyal, Shachar Mirkin. Text Summarization through Entailment-based Minimum Vertex Cover. *SEM 2014.  PDF  Bib entry
Sentence Connectivity is a textual characteristic that may be incorporated intelligently for the selection of sentences of a well meaning summary. However, the existing summarization methods do not utilize its potential fully. The present paper introduces a novel method for single document text summarization. It poses the text summarization task as an optimization problem, and attempts to solve it using Weighted Minimum Vertex Cover (wMVC), a graph-based algorithm. Textual entailment, an established indicator of semantic relationships between text units, is used to measure sentence connectivity and construct the graph on which wMVC operates. Experiments on a standard summarization dataset show that the suggested algorithm outperforms related methods.

Shachar Mirkin and Nicola Cancedda. Assessing Quick Update Methods of Statistical Translation Models IWSLT 2013.  PDF  Bib entry
The ability to quickly incorporate incoming training data into a running translation system is critical in a number of applications. Mechanisms based on incremental model update and the online EM algorithm hold the promise of achieving this objective in a principled way. Still, efficient tools for incremental training are yet to be available. In this paper we experiment with simple alternative solutions for interim model updates, within the popular Moses system. Short of updating the model in real time, such updates can execute in short timeframes even when operating on large models, and achieve a performance level close to, and in some cases exceeding, that of batch retraining.

William Darling, Cédric Archambeau, Shachar Mirkin and Guillaume Bouchard. Error Prediction with Partial Feedback. ECML/PKDD 2013.  PDF  Bib entry
In this paper, we propose a probabilistic framework for predicting the root causes of errors in data processing pipelines made up of several components when we only have access to partial feedback; that is, we are aware when some error has occurred in one or more of the components, but we do not know which one. The proposed error model enables us to direct the user feedback to the correct components in the pipeline to either automatically correct errors as they occur, retrain the component with assimilated training examples, or take other corrective action. We present the model and describe an Expectation Maximization (EM)-based algorithm to learn the model parameters and predict the error configuration. We demonstrate the accuracy and usefulness of our method first on synthetic data, and then on two distinct tasks: error correction in a 2-component opinion summarization system, and phrase error detection in statistical machine translation.

Shachar Mirkin, Sriram Venkatapathy and Marc Dymetman. Confidence-driven Rewriting for Improved Translation. MT Summit 2013.  PDF  Bib entry
Some source texts are more difficult to translate than others. One way to handle such texts is to modify them prior to translation. Yet, a prominent factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. We present an approach, and an interactive tool implementing it, where source sentences are rewritten in order to maximize confidence estimates with respect to the translation model. The automatically-generated rewritings are then proposed for the user’s approval. Such an approach can reduce post-editing effort, replacing it by cost-effective pre-editing that can be done by monolinguals.

Shachar Mirkin, Sriram Venkatapathy, Marc Dymetman and Ioan Calapodescu. SORT: Interactive Source-Rewriting for Improved Translation. ACL 2013 System Demonstration.  PDF  Bib entry
The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s approval. Such a system can reduce post-editing effort, replacing it by cost-effective pre-editing that can be done by monolinguals.

Sriram Venkatapathy and Shachar Mirkin. 2012. An SMT-driven Authoring Tool. COLING 2012.  PDF  Bib entry
This paper presents a tool for assisting users in composing texts in a language they do not know. While Machine Translation (MT) is pretty useful for understanding texts in an unfamiliar language, current MT technology has yet to reach the stage where it can be used reliably without a post-editing step. This work attempts to make a step towards achieving this goal. We propose a tool that provides suggestions for the continuation of the text in the source language (language that the user knows), thus creating texts that can be translated to the target language (language that the user does not know). In terms of functionality, our tool resembles text prediction applications. However , the target language, through a Statistical Machine Translation (SMT) model, drives the composition and not only the source language. We present the user interface and describe the considerations that underline the suggestion process. A simulation of user interaction shows that composition speed can be substantially reduced and provides initial positive feedback as to the ability to generate better translations.

Shachar Mirkin. 2011. Context and Discourse in Textual Entailment Inference. PhD Thesis. Department of Computer Science, Bar-Ilan University.  PDF

Shachar Mirkin, Ido Dagan, Lili Kotlerman and Idan Szpektor. 2011. Classification-based Contextual Preferences. TextInfer 2011.
This paper addresses context matching in textual inference. We formulate the task under the Contextual Preferences framework which broadly captures contextual aspects of inference. We propose a generic classificationbased scheme under this framework which coherently attends to context matching in inference and may be employed in any inferencebased task. As a test bed for our scheme we use the Name-based Text Categorization (TC) task. We define an integration of Contextual Preferences into the TC setting and present a concrete self-supervised model which instantiates the generic scheme and is applied to address context matching in the TC task. Experiments on standard TC datasets show that our approach outperforms the state of the art in context modeling for Name-based TC.  PDF  Bib entry  Ppt

Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido Dagan. 2011. Knowledge and Tree-Edits in Learnable Entailment Proofs. Text Analysis Conference (RTE-7).
This paper describes BIUTEE - Bar Ilan University Textual Entailment Engine. BIUTEE is a natural language inference system in which the hypothesis is proven by the text, based on linguistic- and world- knowledge resources, as well as syntactically motivated tree transformations. The main progress in BIUTEE in the last year is a new confidence model that estimates the validity of the proof found by BIUTEE.  PDF

Asher Stern, Eyal Shnarch, Amnon Lotan, Shachar Mirkin, Lili Kotlerman, Naomi Zeichner, Jonathan Berant and Ido Dagan. 2010. Rule Chaining and Approximate Match in textual inference. Text Analysis Conference (RTE-6)  PDF  Bib entry
This paper describes the participation of Bar-Ilan university in the sixth RTE challenge. Our textual-entailment engine, BiuTee , was enhanced with new components that introduce chaining of lexical-entailment rules, and tackle the problem of approximately matching the text and the hypothesis after all available knowledge of entailment rules was utilized. We have also re-engineered our system aiming at an open-source open architecture. BiuTee's performance is better than the median of all-submissions, and outperforms significantly an IR-oriented baseline.

Shachar Mirkin, Jonathan Berant, Ido Dagan and Eyal Shnarch. 2010. Recognising Entailment within Discourse. COLING.   PDF  Bib entry
Texts are commonly interpreted based on the entire discourse in which they are situated. Discourse processing has been shown useful for inference-based application; yet, most systems for textual entailment – a generic paradigm for applied inference – have only addressed discourse considerations via off-the-shelf coreference resolvers. In this paper we explore various discourse aspects in entailment inference, suggest initial solutions for them and investigate their impact on entailment performance. Our experiments suggest that discourse provides useful information, which significantly improves entailment inference, and should be better addressed by future entailment systems.

Shachar Mirkin, Ido Dagan and Sebastian Padó. 2010. Assessing the Role of Discourse References in Entailment Inference. Proceedings of ACL.  PDF  Bib entry
Discourse references, notably coreference and bridging, play an important role in many text understanding applications, but their impact on textual entailment is yet to be systematically understood. On the basis of an in-depth analysis of entailment instances, we argue that discourse references have the potential of substantially improving textual entailment recognition, and identify a number of research directions towards this goal.

Wilker Aziz, Marc Dymetman, Shachar Mirkin, Lucia Specia, Nicola Cancedda and Ido Dagan. 2010. Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words. Proceedings of EAMT.  PDF   Bib  
We present a general method for incorporating an “expert” model into a Statistical Machine Translation (SMT) system, in order to improve its performance on a particular “area of expertise”, and apply this method to the specific task of finding adequate replacements for Out-of-Vocabulary (OOV) words. Candidate replacements are paraphrases and entailed phrases, obtained using monolingual resources. These candidate replacements are transformed into “dynamic biphrases”, generated at decoding time based on the context of each source sentence. Standard SMT features are enhanced with a number of new features aimed at scoring translations produced by using different replacements. Active learning is used to discriminatively train the model parameters from human assessments of the quality of translations. The learning framework yields an SMT system which is able to deal with sentences containing OOV words but also guarantees that the performance is not degraded for input sentences without OOV words. Results of experiments on English-French translation show that this method outperforms previous work addressing OOV words in terms of acceptability.

Azad Abad, Luisa Bentivogli, Ido Dagan, Danilo Giampiccolo, Shachar Mirkin, Emanuele Pianta and Asher Stern. 2010. A Resource for Investigating the Impact of Anaphora and Coreference on Inference. Proceedings of LREC.  PDF   Bib  
Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated. A by-product of the annotation is a new “augmented” data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.

Shachar Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Dagan, Eyal Shnarch, Asher Stern and Idan Szpektor. 2009. Addressing Discourse and Document Structure in the RTE Search Task. TAC (RTE-5 Search task submission report -- ranked 1st). PDF  Bib  
This paper describes Bar-Ilan University's submissions to RTE-5. This year we focused on the Search pilot, enhancing our entailment system to address two main issues introduced by this new setting: scalability and, primarily, document-level discourse. Our system achieved the highest score on the Search task amongst participating groups, and proposes first steps towards addressing this challenging setting.

Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman and Idan Szpektor. 2009. Source-Language Entailment Modeling for Translating Unknown Terms. Proceedings of ACL-IJCNLP. PDF   Bib  
This paper addresses the task of handling unknown terms in SMT. We propose using source-language monolingual models and resources to paraphrase the source text prior to translation. We further present a conceptual extension to prior work by allowing translations of entailed texts rather than paraphrases only. A method for performing this process efficiently is presented and applied to some 2500 sentences with unknown terms. Our experiments show that the proposed approach substantially increases the number of properly translated texts.

Shachar Mirkin, Ido Dagan, Eyal Shnarch. 2009. Evaluating the Inferential Utility of Lexical-Semantic Resources. Proceedings of EACL. PDF  Bib entry   Ppt
Lexical-semantic resources are used extensively for applied semantic inference, yet a clear quantitative picture of their current utility and limitations is largely missing. We propose system- and application-independent evaluation and analysis methodologies for resources’ performance, and systematically apply them to seven prominent resources. Our findings identify the currently limited recall of available resources, and indicate the potential to improve performance by examining non-standard relation types and by distilling the output of distributional methods. Further, our results stress the need to include auxiliary information regarding the lexical and logical contexts in which a lexical inference is valid, as well as its prior validity likelihood.

Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, Idan Szpektor. Efficient Semantic Deduction and Approximate Matching over Compact Parse Forests. 2008. Text Analysis Conference (TAC).  PDF   Bib
Semantic inference is often modeled as application of entailment rules, which specify generation of entailed sentences from a source sentence. Efficient generation and representation of entailed consequents is a fundamental problem common to such inference methods. We present a new data structure, termed compact forest, which allows efficient generation and representation of entailed consequents, each represented as a parse tree. Rule-based inference is complemented with a new approximate matching measure inspired by tree kernels, which is computed efficiently over compact forests. Our system also makes use of novel large-scale entailment rule bases, derived fromWikipedia as well as from information about predicates and their argument mapping, gathered from available lexicons and complemented by unsupervised learning.

Shachar Mirkin, Ido Dagan, Maayan Geffet. 2006. Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition. COLING-ACL.  PDF  Bib entry
This paper addresses the problem of acquiring lexical semantic relationships, applied to the lexical entailment relation. Our main contribution is a novel conceptual integration between the two distinct acquisition paradigms for lexical relations – the patternbased and the distributional similarity approaches. The integrated method exploits mutual complementary information of the two approaches to obtain candidate relations and informative characterizing features. Then, a small size training set is used to construct a more accurate supervised classifier, showing significant increase in both recall and precision over the original approaches.

Shachar Mirkin. 2006. MSc Thesis. Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition. School of Computer Science and Engineering, the Hebrew University of Jerusalem.  PDF  Bib


Model-aware improvement of source translatability. Laboratoire d'Informatique de Grenoble (LIG). June 2013

Textual entailment inference in machine translation (with Ido Dagan). Workshop of Machine Translation and Morphologically-rich Languages. January 2011. [slides]

Incorporating Discourse Information within Textual Entailment Inference. Invited talk at the Institute of Formal and Applied Linguistics, Charles University, Prague. November 2010.

Context Models for Textual Entailment and their Application to Statistical Machine Translation. PASCAL2 Pump Priming and Thematic Programme Workshop September 2009. Bled, Slovenia.

Evaluating the Inferential Utility of Lexical-Semantic Resources. BISFAI-09. June 2009.

Source-Language Entailment Modeling for Translating Unknown Terms. BISFAI-09. June 2009.

Introduction to Chinese NLP. The Eight Annual Conference of Asian Studies in Israel. June 2009.

Textual Entailment. Xerox Research Centre Europe. June 2008.

Chinese Word Segmentation. The Fourth Annual Conference of Asian Studies in Israel. May 2005.


Academic Activities
  • Program Committee Member/Reviewer:
  • Coordinator of a Pascal-2 Pump Priming Project, done in colaboration with Xerox Research Centre Europe (XRCE), titled: Context Models for Textual Entailment and their Application to Statistical Machine Translation.
  • TA in the courses: Intro to CS, Data Structures and Algorithms at The Hebrew University of Jerusalem, 2003-2005.

  • Education
    2011: PhD in Computer Science -- Natural Language Processing. Bar-Ilan University. Advisor: Prof. Ido Dagan.
    2006: MSc in Computer Science -- Natural Language Processing. The Hebrew University of Jerusalem. Advisors: Prof. Ido Dagan & Prof. Ari Rappoport.
    2000: BSc (Cum Laude) in Computer Science and East Asia Studies, The Hebrew University of Jerusalem.
    Where I've been working
    University of Grenoble Alps, Xerox Research, Amdocs, Intel, Interwise & ClearForest in Israel, Chine and France.
    For a more organized course of events see:

    View Shachar Mirkin's profile on LinkedIn

    (Natural) Language Learning Links
    My own handy list of language learning links

  • French Duolingo  --  Memrise  --  English-French Dictionary  --  French-English  --  Bleu Azur Radio  --  French accents
  • Japanese Japanese dictionary  --  JapanesePod101 (Beginners)  --  Hiragana Quiz  --  Katakana Quiz  --  Japanese telenovelas  --  Japanit (Hebrew)  --  Verb conjugator
  • Italian Radio 105  -- English-Italian dictionary
  • Chinese ChinesePod  --  BBC Chinese  --  90.3 FM (Taiwan)  --  Chinese-English Dictionary  --  Chinese music
  • Arabic איילון-שנער --  صوت اسرائيل  --  BBC in Arabic  --  Englidh-Arabic dictionary
  • English WordWeb Online  -- Urban Dictionary  -- Etymology Dictionary  --  -- Common errors

  • שחר מירקין

    נכון לאמצע 2014 אני בגרנובל, צרפת. אני עוסק בעיקר בתחום של תרגום אוטומטי אך גם בנושאים אחרים של עיבוד שפה טבעית. ב-2011 סיימתי דוקטורט במדעי המחשב באוניברסיטת בר-אילן, בתחום של עיבוד שפה טבעית בהנחיית פרופ' עידו דגן .

    עוד פרטים בעמוד באנגלית

    צרו קשר
    my email

    View Shachar Mirkin's profile on LinkedIn