Shachar Mirkin   -   שחר מירקין

[ Contact | Publications | Talks | Resources | Language learning links | עברית ]
Shachar Mirkin    How to pronounce my name

In 2011 I completed my PhD studies at the NLP lab at the Computer Science Department, Bar Ilan University.
I'm now at Xerox Research Centre Europe, working on Statistical Machine Translation (SMT).

My PhD research was in the field of Natural Language Processing and was done under the instruction of Prof. Ido Dagan. I was working on applied semantic inference, under the Textual Entailment framework, exploring various aspects of textual entailment, including knowledge acquisition, discourse and contextual models, as well as the utilization of textual entailment in NLP applications, such as Statistical Machine Translation.

Contact me
my email
Publications
Sriram Venkatapathy and Shachar Mirkin. 2012. An SMT-driven Authoring Tool. COLING 2012.  PDF  Bib entry
This paper presents a tool for assisting users in composing texts in a language they do not know. While Machine Translation (MT) is pretty useful for understanding texts in an unfamiliar language, current MT technology has yet to reach the stage where it can be used reliably without a post-editing step. This work attempts to make a step towards achieving this goal. We propose a tool that provides suggestions for the continuation of the text in the source language (language that the user knows), thus creating texts that can be translated to the target language (language that the user does not know). In terms of functionality, our tool resembles text prediction applications. However , the target language, through a Statistical Machine Translation (SMT) model, drives the composition and not only the source language. We present the user interface and describe the considerations that underline the suggestion process. A simulation of user interaction shows that composition speed can be substantially reduced and provides initial positive feedback as to the ability to generate better translations.

Shachar Mirkin. 2011. Context and Discourse in Textual Entailment Inference. PhD Thesis. Department of Computer Science, Bar-Ilan University.  PDF

Shachar Mirkin, Ido Dagan, Lili Kotlerman and Idan Szpektor. 2011. Classification-based Contextual Preferences. TextInfer 2011.
This paper addresses context matching in textual inference. We formulate the task under the Contextual Preferences framework which broadly captures contextual aspects of inference. We propose a generic classificationbased scheme under this framework which coherently attends to context matching in inference and may be employed in any inferencebased task. As a test bed for our scheme we use the Name-based Text Categorization (TC) task. We define an integration of Contextual Preferences into the TC setting and present a concrete self-supervised model which instantiates the generic scheme and is applied to address context matching in the TC task. Experiments on standard TC datasets show that our approach outperforms the state of the art in context modeling for Name-based TC.  PDF  Bib entry  Ppt

Asher Stern, Amnon Lotan, Shachar Mirkin, Eyal Shnarch, Lili Kotlerman, Jonathan Berant and Ido Dagan. 2011. Knowledge and Tree-Edits in Learnable Entailment Proofs. Text Analysis Conference (RTE-7).
This paper describes BIUTEE - Bar Ilan University Textual Entailment Engine. BIUTEE is a natural language inference system in which the hypothesis is proven by the text, based on linguistic- and world- knowledge resources, as well as syntactically motivated tree transformations. The main progress in BIUTEE in the last year is a new confidence model that estimates the validity of the proof found by BIUTEE.  PDF

Asher Stern, Eyal Shnarch, Amnon Lotan, Shachar Mirkin, Lili Kotlerman, Naomi Zeichner, Jonathan Berant and Ido Dagan. 2010. Rule Chaining and Approximate Match in textual inference. Text Analysis Conference (RTE-6)  PDF  Bib entry
This paper describes the participation of Bar-Ilan university in the sixth RTE challenge. Our textual-entailment engine, BiuTee , was enhanced with new components that introduce chaining of lexical-entailment rules, and tackle the problem of approximately matching the text and the hypothesis after all available knowledge of entailment rules was utilized. We have also re-engineered our system aiming at an open-source open architecture. BiuTee's performance is better than the median of all-submissions, and outperforms significantly an IR-oriented baseline.

Shachar Mirkin, Jonathan Berant, Ido Dagan and Eyal Shnarch. 2010. Recognising Entailment within Discourse. COLING.   PDF  Bib entry
Texts are commonly interpreted based on the entire discourse in which they are situated. Discourse processing has been shown useful for inference-based application; yet, most systems for textual entailment – a generic paradigm for applied inference – have only addressed discourse considerations via off-the-shelf coreference resolvers. In this paper we explore various discourse aspects in entailment inference, suggest initial solutions for them and investigate their impact on entailment performance. Our experiments suggest that discourse provides useful information, which significantly improves entailment inference, and should be better addressed by future entailment systems.

Shachar Mirkin, Ido Dagan and Sebastian Padó. 2010. Assessing the Role of Discourse References in Entailment Inference. Proceedings of ACL.  PDF  Bib entry
Discourse references, notably coreference and bridging, play an important role in many text understanding applications, but their impact on textual entailment is yet to be systematically understood. On the basis of an in-depth analysis of entailment instances, we argue that discourse references have the potential of substantially improving textual entailment recognition, and identify a number of research directions towards this goal.

Wilker Aziz, Marc Dymetman, Shachar Mirkin, Lucia Specia, Nicola Cancedda and Ido Dagan. 2010. Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words. Proceedings of EAMT.  PDF   Bib  
We present a general method for incorporating an “expert” model into a Statistical Machine Translation (SMT) system, in order to improve its performance on a particular “area of expertise”, and apply this method to the specific task of finding adequate replacements for Out-of-Vocabulary (OOV) words. Candidate replacements are paraphrases and entailed phrases, obtained using monolingual resources. These candidate replacements are transformed into “dynamic biphrases”, generated at decoding time based on the context of each source sentence. Standard SMT features are enhanced with a number of new features aimed at scoring translations produced by using different replacements. Active learning is used to discriminatively train the model parameters from human assessments of the quality of translations. The learning framework yields an SMT system which is able to deal with sentences containing OOV words but also guarantees that the performance is not degraded for input sentences without OOV words. Results of experiments on English-French translation show that this method outperforms previous work addressing OOV words in terms of acceptability.

Azad Abad, Luisa Bentivogli, Ido Dagan, Danilo Giampiccolo, Shachar Mirkin, Emanuele Pianta and Asher Stern. 2010. A Resource for Investigating the Impact of Anaphora and Coreference on Inference. Proceedings of LREC.  PDF   Bib  
Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated. A by-product of the annotation is a new “augmented” data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.

Shachar Mirkin, Roy Bar-Haim, Jonathan Berant, Ido Dagan, Eyal Shnarch, Asher Stern and Idan Szpektor. 2009. Addressing Discourse and Document Structure in the RTE Search Task. TAC (RTE-5 Search task submission report -- ranked 1st). PDF  Bib  
This paper describes Bar-Ilan University's submissions to RTE-5. This year we focused on the Search pilot, enhancing our entailment system to address two main issues introduced by this new setting: scalability and, primarily, document-level discourse. Our system achieved the highest score on the Search task amongst participating groups, and proposes first steps towards addressing this challenging setting.

Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman and Idan Szpektor. 2009. Source-Language Entailment Modeling for Translating Unknown Terms. Proceedings of ACL-IJCNLP. PDF   Bib  
This paper addresses the task of handling unknown terms in SMT. We propose using source-language monolingual models and resources to paraphrase the source text prior to translation. We further present a conceptual extension to prior work by allowing translations of entailed texts rather than paraphrases only. A method for performing this process efficiently is presented and applied to some 2500 sentences with unknown terms. Our experiments show that the proposed approach substantially increases the number of properly translated texts.

Shachar Mirkin, Ido Dagan, Eyal Shnarch. 2009. Evaluating the Inferential Utility of Lexical-Semantic Resources. Proceedings of EACL. PDF  Bib entry   Ppt
Lexical-semantic resources are used extensively for applied semantic inference, yet a clear quantitative picture of their current utility and limitations is largely missing. We propose system- and application-independent evaluation and analysis methodologies for resources’ performance, and systematically apply them to seven prominent resources. Our findings identify the currently limited recall of available resources, and indicate the potential to improve performance by examining non-standard relation types and by distilling the output of distributional methods. Further, our results stress the need to include auxiliary information regarding the lexical and logical contexts in which a lexical inference is valid, as well as its prior validity likelihood.

Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal Shnarch, Idan Szpektor. Efficient Semantic Deduction and Approximate Matching over Compact Parse Forests. 2008. Text Analysis Conference (TAC).  PDF   Bib
Semantic inference is often modeled as application of entailment rules, which specify generation of entailed sentences from a source sentence. Efficient generation and representation of entailed consequents is a fundamental problem common to such inference methods. We present a new data structure, termed compact forest, which allows efficient generation and representation of entailed consequents, each represented as a parse tree. Rule-based inference is complemented with a new approximate matching measure inspired by tree kernels, which is computed efficiently over compact forests. Our system also makes use of novel large-scale entailment rule bases, derived fromWikipedia as well as from information about predicates and their argument mapping, gathered from available lexicons and complemented by unsupervised learning.

Shachar Mirkin, Ido Dagan, Maayan Geffet. 2006. Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition. COLING-ACL.  PDF  Bib entry
This paper addresses the problem of acquiring lexical semantic relationships, applied to the lexical entailment relation. Our main contribution is a novel conceptual integration between the two distinct acquisition paradigms for lexical relations – the patternbased and the distributional similarity approaches. The integrated method exploits mutual complementary information of the two approaches to obtain candidate relations and informative characterizing features. Then, a small size training set is used to construct a more accurate supervised classifier, showing significant increase in both recall and precision over the original approaches.

Shachar Mirkin. 2006. MSc Thesis. Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition. School of Computer Science and Engineering, the Hebrew University of Jerusalem.  PDF  Bib

Talks

Textual entailment inference in machine translation (with Ido Dagan). Workshop of Machine Translation and Morphologically-rich Languages. January 2011. [slides]

Incorporating Discourse Information within Textual Entailment Inference. Invited talk at the Institute of Formal and Applied Linguistics, Charles University, Prague. November 2010.

Context Models for Textual Entailment and their Application to Statistical Machine Translation. PASCAL2 Pump Priming and Thematic Programme Workshop September 2009. Bled, Slovenia.

Evaluating the Inferential Utility of Lexical-Semantic Resources. BISFAI-09. June 2009.

Source-Language Entailment Modeling for Translating Unknown Terms. BISFAI-09. June 2009.

Introduction to Chinese NLP. The Eight Annual Conference of Asian Studies in Israel. June 2009.

Textual Entailment. Xerox Research Centre Europe. June 2008.

Chinese Word Segmentation. The Fourth Annual Conference of Asian Studies in Israel. May 2005.

Resources

Academic Activities
  • Program Committee Member/Reviewer:
  • Coordinator of a Pascal-2 Pump Priming Project, done in colaboration with Xerox Research Centre Europe (XRCE), titled: Context Models for Textual Entailment and their Application to Statistical Machine Translation.
  • TA in the courses: Intro to CS, Data Structures and Algorithms at The Hebrew Univeristy of Jerusalem, 2003-2005.

  • Education
    2011: PhD in Computer Science -- Natural Language Processing. Bar-Ilan University. Advisor: Prof. Ido Dagan.
    2006: MSc in Computer Science -- Natural Language Processing. The Hebrew Univeristy of Jerusalem. Advisors: Prof. Ido Dagan & Prof. Ari Rappoport.
    2000: BSc (Cum Laude) in Computer Science and East Asia Studies, The Hebrew Univeristy of Jerusalem.
    Where I've been working
    Amdocs, Intel, Interwise & ClearForest in Jerusalem, Beijing and (greater) Tel Aviv.
    For a more organized course of events see:

    View Shachar Mirkin's profile on LinkedIn

    (Natural) Language Learning Resources
    My own handy list of language learning links

  • French Google French  --  English-French Dictionary  --  French-English  --  Virgin Radio  --  Bleu Azur Radio  --  French accents
  • Japanese Japanese dictionary  --  JapanesePod101 (Beginners)  --  Hiragana Quiz  --  Katakana Quiz  --  Japanese telenovelas  --  Japanese-Flashcards  --  Japanit (Hebrew)  --  Verb conjugator
  • Italian Radio 105  -- English-Italian dictionary
  • Chinese ChinesePod  --  BBC Chinese  --  90.3 FM (Taiwan)  --  Chinese-English Dictionary  --  CE Dictionary (2) --  Chinese music
  • Arabic איילון-שנער --  صوت اسرائيل(IE only) --  BBC in Arabic  --  Englidh-Arabic dictionary
  • English WordWeb Online  -- Urban Dictionary  -- Etymology Dictionary  -- dictionary.com  -- Common errors
  • Other
  • Our lab members: Prof. Ido Dagan, Jonathan Berant, Eyal Shnarch, Naomi Zeichner, Lili Kotlerman, Asher Stern, Meni Adler, Erel Segal
  • Inbal Ravid / ענבל רביד
  • עזריאל מירקין
  • Rock Climbing: מועדון המטפסים הישראלי  -- אנציקלופדית הטיפוס  -- videoclimb
  • My mirror site

  • שחר מירקין

    נכון לתחילת 2013, אני בעיצומו של פוסט-דוקטורט במרכז המחקר של זירוקס בגרנובל, צרפת. אני עוסק בעיקר בתחום של תרגום אוטומטי אך גם בנושאים אחרים של עיבוד שפה טבעית. ב-2011 סיימתי דוקטורט במדעי המחשב באוניברסיטת בר-אילן, בתחום של עיבוד שפה טבע, בהנחיית פרופ' עידו דגן ,

    עוד פרטים בעמוד באנגלית

    צור קשר
    my email

    View Shachar Mirkin's profile on LinkedIn