The Spanish Corpus of 21th Century

•June 18, 2008 • No Comments

In 1995, when the Royal Spanish Academy started the project of CREA and CORDE, these were the needed tools. Spanish language needed a serie of corpuses in the net, for the shake of becoming noticed for the rest of the world, and helping the people that was learning the language. The creation of both corpuses also meant a modification of the working system of the technical teams that formed the Royas Spanish Academy and the different academies of the Spanish language. In general, it meant a change in the working systems of any institution or group who was inevstigating the language.

However, the design of the two corpuses has become old-fashioned for today’s necessities. This being noticed by the Royas Spanish Academy, it has proposed the creation of a new Spanish corpus, with 25 million of lexical forms for each use between the years 2000 and 2011. This was proposed in Medellin in March, during the Academies’ Congress. This Spanish Corpus of 21th Century will have, in its first stage, 300 million of forms, which guarantees that the new editions of DRAE would have the required empirical fundament. The proposal was approved in unanimity.

The webpage design, control, acquisition and texts codification, and the construction of the informatic tools which would allow the search in the corpus from any place in the world, requires a considerable inversion and the work of a large group of people. The creation of the Spanish Corpus of the XXI century is going to be distributed into different teams which would converge in a central team. The central team would be formed by five or eight members of the Royal Spanish Acdemy and they would coordinate the rest of the teams. The Academy would be the responsible of the design, materials selection, the establishment of the codification system, the trainning of the rest of the teams, the control of the work of the teams, etc. The teams would codificate texts of different linguistic zones.

The project of the Spanish Corpus of 21th Century represents a project that shares perfectly the different purposes of the Royal Spanish Academy, the Spanish Academies Asociation and the Santander bank. It means the construction of a better and large media centered in the knowledge of the Spanish language by the Internet tools.

Corpus Tipology

•June 16, 2008 • No Comments

There are different parameters to clasify the corpuses. These parameters are:

1.- Language form: written, oral.
2.- The number of languages to which the works collected belong.
3.- The size and quantity of the texts that conform the corpus.
4.- If the corpus is open or close, that is if it is for external use or not.
5.- Linguistic variety or the grade of especialization of the texts.
6.- The temporal period to which the text belong.
7.- The word processing.

And in relation with the language, there are different types of corpuses. We can name these types and describe them as following:

- Monolingual Corpus: Compound by texts of a certain language. No mixed languages. The texts are collected with the aim of showing the language and its linguistic variety.

- Bilingual or Multilingual Corpus: Compound by texts of two or more languages. The texts are not necessary translations and they don’t share selection criteria.

- Comparative Corpus (paired texts): Compound by a selection of texts in more than a language or linguistic variety which share similar charasteristics and selection criteria. This type of corpuses are used especially in the comparative task.

- Parallel Corpus (bi-texts): These type of corpuses collect texts in more than one language but, as a difference in comparisson with the previous types, the different texts are translations in different languages. That is, the text is the same but in different languages. The simple example of this corpus is just a text and its translation into another language. This type of corpuses are very useful in the authomatic translation and in bilingual or multilingual backgrounds.

- Aligned Corpus: To simplify the use, this type of corpuses show the texts we are going to compare one next to the other. This helps in the extraction of the differences between the texts we are analizing.

All the information about the different corpuses and the way in which experts classify corpuses I have obtained from the webpage of the University of León, here. In this webpage, we can find more parameters to recognize different types of corpuses. I have only mentioned the different corpuses in relation to languages, but there is also the differentiation in relation with quantity, proportion and distribution of the texts in each corpus, the representativity or the process to which the corpus is subjected to.

The webpage is in Spanish.

CREA and CORDE

•June 14, 2008 • No Comments

CREA and CORDE are the two corpuses created by the Royal Spanish Academy. Each of them compilates a different use of the Spanish language. The first one is a collection of examples of the use of Spanish in the present, while the second is a collection of the uses in the past, diachronically.

CREA was created in 1995. It was created to help linguists and people who were interested in the evolution of the Spanish language and it use today. As I said in the previous paragraph, CREA is a collection of data of the use of Spanish nowadays. If you are searching about the use of any phrase or term, for example “Hoy día”, you can search it in CREA by writing the phrase in the place where says “consulta”. Then, you can choose, if you want, different selection criterion like authorship, date, or media. You can also reduce the search by choosing any theme. Then, you click in the “buscar” buttom and there you go. Now you have the results and the possibility of making a further selection and a selection of different examples.

Then we have CORDE, which was creater soon after CREA. It’s main aim is to function as a collection of the use of Spanish in the past. It is a corpus in which you can find the different use that any phrase or term in Spanish has had during the history of the language. CORDE works exactly as CREA. It has the same estructure in the search, and it gives you the same possibilities.

The creation of both corpuses has been a good idea for the shake of Spanish language and its linguists. Both are used by thousands of people and both are really complete. I would recommended them to any person who is interested in Spanish or who has to do any work in relation with Spanish and its use.

Mark Davis

•June 4, 2008 • No Comments

Mark Daviesis the creator, sponsored by BYU and NEH, of several applications of different language’s corpuses. He has helped in the creation of Corpus del Español, the British National Corpus, the Corpus of American English or even the Corpus do Português. He has also created the OED Corpus of Historical English and the TIME Corpus. We can say that he is a man who has done a lot for the corpora on the net and who has created a lot of useful tools for linguists and people in general. But who is Mark Davies and what exactly do his proyects consist on?

First of all, we must say that Davies is a university teacher. He is proffesor of Corpus Linguistics in the Department of Linguistics and English Language at Brigham Young University (BYU) in Utah, United States of America. He had also been proffesor of Spanish Linguistics at Illinois State University. As we can find in his webpage at BYU, his primary research areas and activity are in things like corpus and computational linguistics, design and optimization of linguistic databases or historical linguistics and syntactic variation. That is, his works are always dedicated to language connecting with the Internet.

As I said previously, he has helped in the creation of several corpuses’ applications in different languages. This are useful tools which worked exactly as a normal corpus. They are less complete than the corpus of the National Institutions of each country. That is, the Corpus del Español of Davies/BYU/NEH is less complete than the two corpuses created by the Royal Spanish Academy: CREA and CORDE. But anyway, Davies’ corpora is also very useful for people in general. You can take a look to some comparissons between the British National Corpus and the application of the British National Corpus by Davies and between CREA and Corpus del Español by Davies in the wikipage of our Language Resource class.

Finally, I will only say that people like Mark Davies are making possible the total expansion of Internet. With the creation of databases like corpuses of different languages, they are helping humanity. They are creating tools that could help from linguists to students.

Text Corpus

•June 2, 2008 • No Comments

This semester we have been working with some Internet tools called Corpuses or Corpora, in its plural name. There are a lot of this linguistics staff on the net, and in class, we had to made a power point presentation about some of them. We could choose the corpus we were going to use and we had to publish our presentation/work in the Wiki. It was a difficult work to do, because, first of all, we didn’t know nothing about corpuses and about what they are. Now, after looking through the net, we can more or less explain what a corpus is.

Explaining it briefly, we must say that a corpus is a compilation of words with contexts of use. That is, it is a collection of real examples of the use of different languages. Normally, the use of the language is given by examples in literary texts or articles (written), but there are also some corpuses that had an oral collection in the case of some terms or phrases of the language.  There are corpuses specialized in different languages and each of them have a different number of examples of any phrase or term in these languages.

For my project, I have been working with Spanish corpora. My team and I have chosen two differnet corpuses of Spanish language: Real Academia Española’s CREA and Mark Davies and NEH’s application of the Spanish Corpus. These two corpuses offered us a number of examples for each phrase we chose, and the result is published in our wiki.

Finally, I only want to recommend you the use of corpora if you are interested in finding the different uses of a phrase during the history of a certain language or in the present of this language. The corpora are fantastic tools for a linguist, but if you want only learn vocabulary of a certain language, or how to say something in that language, I recommend you better the dictionaries.

European Languages Resources Association (ELRA)

•February 28, 2008 • No Comments

ELRA is the short name given to the European Languages Resources Association, which was established as a non-profit organisation in Luxemburg in 1995, as the official webpage of ELRA tells us. It’s main aim is “to make available the language resources for language engineering and to evaluate language engineering technologies”.

ELRA tries to achieve its goal by being active in several aspects inside the broad theme of Language Resources. ELRA is “active in identification, distribution, collection, validation, standardisation, improvement, in promoting the production of language resources, in supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources and evaluation”. This activities are achieved in ELDA, the operational body of European Languages Resources Association.

 Associations mission is to promote language resources for Human Language Technologies (HLT) and to evaluate engineerings technologies of languages. To achive this mission, ELRA provides us several services, such as identification of language resources or distribution of these resources, as I said in the previous paragraph.  But the association also publishes a newsletter and makes market studies in the field of HLT. In doing this, we can say that ELRA is helping in the promotion and expansion of Human Language Technologies. 

ELRA was created by RELATOR, a project by the European Commission and a large key players with a consortium of researches working in the nine official languages of the European Union (EU). The main achievement of this project was the creation of ELRA.

Finally, I will recommend the interested reader to visit the official webpage of the association. There, we can learn about ELRA’s and ELDA’s history, about the services they provide us, etc. It is a very interesting webpage if you’re interested on the theme of Language Technologies.

Welcome to Jhedo’s WordPress

•February 26, 2008 • No Comments

Hello everybody!!!

I have to confess that I made this blog because the teacher told me. I have a Livejournal adress and a Fotolog, so I don’t have time to start another Blog. Probably I will use this WordPress account only to publish the articles for university.

Bye, bye loves!

Hello world!

•February 26, 2008 • 1 Comment

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!