Problem: given a relatively large collection of parallel texts and a state-of-the-art SMT system, how to incrementally & automatically mine the parallel texts available on the Web. The newly added texts should ensure to improve the current SMT system.
1) Large Scale Parallel Document Mining for Machine Translation. COLING 2010. Link.
Crowd-sourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner’s perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts.
WebAnnotator is a new tool for annotating Web pages implemented at LIMSI. Giving it a try will take you no more than 10 minutes.
WebAnnotator is implemented as a Firefox extension, allowing annotation of both offline and inline pages. The HTML rendering is fully preserved and all annotations consist in new HTML spans with specific styles.
WebAnnotator provides an easy and general-purpose framework and is made available under CeCILL free license (close to GNU GPL), so that use and further contributions are made simple.
All parts of an HTML document can be annotated: text, images, videos, tables, menus, etc. The annotations are created by simply selecting a part of the document and clicking on the relevant type and subtypes. The annotated elements are then highlighted in a specific color. Annotation schemas can be defined by the user by creating a simple DTD representing the types and subtypes that must be highlighted. Finally, annotations can be saved (HTML with highlighted parts of documents) or exported (in a machine-readable format).
WebAnnotator will be presented at LREC conference in May 2012.