MINERVA EC -Handbook on cultural web user interaction/Chapter 2

Handbook on cultural web user interaction
First edition (September 2008)
edited by MINERVA EC Working Group "Quality, Accessibility and Usability"

2.7 Users in the globalised world: multilingualism issues

There is an increasing awareness of the importance of the role of multilingualism in making the digital cultural heritage of Europe available to all users. Language is one of the most significant barriers to website access and, because of this barrier, great parts of the European digital cultural heritage cannot be found on the Internet.

The problem is complex. There are nearly 7,000 known languages spoken throughout the world6, approximately 2,200 of them also have writing systems but just 300 have some kind of language processing tools. In Europe alone, the European Union currently has 23 official languages, and there are many more languages actually in everyday use. However, despite this global multilinguality, the English language still tends to dominate the Internet, although to a lesser degree than in the past. It is clear that if national languages are to be preserved for the future, multilingual access points must be provided.

Trend towards multilingual web (source: http://global-rearch.biz/globstats/evol.html)

In the information society, the acquisition and dissemination of information in digital form must transcend language boundaries: if the Web is to be used for knowledge dissemination and acquisition, its content must be available in many languages. Information providers and seekers should have equal opportunities, regardless of the language which they prefer.

When we talk about access to information without language or cultural barriers we mean that certain functionality must be guaranteed: it must be possible to find information in foreign languages, to read and interpret that information and to merge it with information in other languages.

Research on Multilingual Information Access (MLIA) thus focuses on the storage, access, retrieval and presentation of information in any of the world’s languages.

There are two main areas of interest:
• multiple language access, which addresses the enabling technologies for browsing and display, such as character encoding, support for the specific requirements of particular languages and scripts, internationalization & localization
• cross-language information discovery and retrieval (CLIR), which addresses the problem of querying in one language a collection containing documents in many other languages, of filtering, selecting, and ranking the retrieved documents and of presenting the resulting information in an interpretable and exploitable fashion.

The main (although certainly not the only) problem when building a CLIR system is to be able to match the user query against the document collection. In order to do this both queries and documents must be pre-processed and indexed – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.). Various approaches are adopted generally involving the translation of either the queries or the documents (or both). Systems that cater for many languages may use an interlingua or pivot language. Translation resources can be Machine Translation (MT), parallel/comparable corpora, bilingual dictionaries, multilingual thesauri, conceptual interlingua. The most successful systems often use a combination of more than one translation resource.

The main CLIR difficulties involve the correct handling of language identification, morphology, proper names, terminology, multi-word concepts, phrases and idioms, ambiguity and polysemy. In particular, the processing of many languages simultaneously, merging results from different sources/media, and the presentation of the results in appropriate fashion for the specific user represent challenging issues and satisfactory solutions are still being investigated.

Interactive CLIR systems can help users to locate and identify relevant foreign-language documents, by formulating and translating the query or by query re-formulation, browsing/navigating results and/or identifying relevant documents.

Providing multilingual retrieval for a mixed media collection is a non-trivial problem. Different media are processed in different ways and suffer from different kinds of indexing errors: spoken documents are indexed using speech recognition, handwritten documents are indexed using OCR, and image collections use feature-based indexing. Retrieval in such cases implies a complex integration of multiple technologies.

In any case, implementing Multilingual Information Access functionality is complex and involves issues at a number of levels. For multilingual portals, it is necessary to decide how many languages should be catered for, how many levels of the site should be multilingual, and how should updates be handled. For monolingual search in a multiple language context, encoding and representation issues (language identification and indexing issues, such as stop words, stemmers, morphological analysers, named entity recognition, etc.) must be addressed. For cross-language search, appropriate translation resources must be acquired, maintained and updated regularly. And finally, the presentation of results must be in a form which is interpretable and exploitable by users.

The MLIA issues for Cultural Heritage have the same problems but systems need fine tuning with respect to the specific terminology and media involved and specific user profile (see 2.4).

2.7.1 A case study: the MultiMatch project

On the web, cultural heritage (CH) content is everywhere, in traditional environments such as libraries, museums, galleries and audiovisual archives, but also in popular magazines and newspapers, in multiple languages and multiple media.

The MultiMatch Search Engine is a first attempt to provide a complete and integrated solution to search CH content. It supports the retrieval of cultural objects through different modalities:
                        •     Free textsearch. This search mode is similar to that provided by general purpose search engines, such as Google, with the difference that MultiMatch is expected to provide more precise results – since information is acquired from selected sources containing CH data – and support for multilingual searches
                        •     Multimedia search, based on similarity matching and on automatic information extraction techniques
                        •     Metadata based search, where the user can select one of the available indexes built for a specific metadata field and can specify the value of the metadata field (e.g. the creator’s name) plus, possible additional terms
                        •     A browsing capability allows users to navigate the MultiMatch collection using a web directory-like structure based on the MultiMatch ontology.

Concerning multilingual functionality in MultiMatch, users can formulate queries in a given language and retrieve results in one or all languages covered by the prototype (English, Italian, Spanish, Dutch, German, and Polish) according to their preferences. Six separate monolingual index files are maintained.

Cross-language searches are performed by a combination of machine translation and domain-specific dictionary components. Users can select the source and the target languages as well as the most appropriate translation among those proposed by the system.

The domain-specific lexicon has been built up by deriving CH vocabulary from appropriate multilingual corpora and in particular from Wikipedia. In addition to the separate monolingual index files, a single multilingual index file, created by translating all incoming documents into English, is maintained to facilitate multilingual searches. Incoming queries in any language can be translated into English and submitted to this index. Retrieval performance is enhanced by the use of thesaurus expansion and relevance feedback.

1 Usability.net, ISO 13407: Human centred design processes for interactive systems, <http://www.usabilitynet.org/tools/13407stds.htm>.
2 Lorenzo Cantoni, Nicoletta Di Blas, Davide Bolchini, Comunicazione, qualità, usabilità, Milano: Apogeo, 2003, p. 33.
3 Lorenzo Cantoni, Nicoletta Di Blas, Davide Bolchini, Comunicazione, qualità, usabilità, op. cit., p. 47.
4 For the text of this directive and other relevant legislation consult the site of the European Data Protection Supervisor, <http://www.edps.europa.eu/EDPSWEB/edps/lang/en/pid/17>.
5 <http://www.w3.org/P3P/ >
6 See http://www.multimatch.eu. The consortium, whose coordinator is Pasquale Savino savino@isti.cnr.it, is composed by Istituto di Scienza e Tecnologie dell’Informazione, University of Sheffield, Dublin City University, University of Amsterdam, University of Geneva, Universidad Nacional de Educación a Distancia, OCLC, WIND Telecomunicazioni S.p.A., Cultural Heritage, Fratelli Alinari Istituto Edizioni Artistiche SpA, Netherlands Institute for Sound and Vision, University of Alicante – Biblioteca Virtual Miguel de Cervantes.

About

MINERVA EC

Structure

Interoperability

Quality, accessibility, usability

Best Practices

Good practices in digitisation

Events

References

Publications

Institutions

torna ai contenuti



	Home \| Search \| Map \| Contact Us
	Path: Home » Publications » Handbook on cultural web user interaction » Table of contents » Chapter 2 » Chapter 2.7



Copyright Minerva Project 2008-09, last revision 2008-09-19, edited by Minerva Editorial Board. URL: www.minervaeurope.org/publications/handbookwebusers/chapter2/chapter2_7.html