Concepts are extracted from source documents during the linguistic analysis process. One of the basic relation types that make a thesaurus hierarchical, is the relation type named “is a”. It realizes the idea of generalization, allowing reference of a narrower concept to a wider one.
Another relation type named “referred to” designates a certain reference between terms. It can be marked by a verb (predicate) extracted from the source sentence (this verb should describe the nature of the relation with certainty). This mark can also be a word-combination, consisting of a verb and a noun: “is operation”, “is an attribute”.
Fourth International Conference I.TECH 2006 To avoid the possibility of the appearance of a vicious circle of interdependence of terms in a thesaurus, there are some rules to be obeyed:
- no term can be wider than itself, either directly, or indirectly (this limits the usage of the “is a”);
- no term can be “referred to” a term which is wider or narrower than itself, either directly, or indirectly.
The structure of the thesaurus can be represented by a graph (semantic net), its nodes correspond to terms, and arches are relations between terms. One set of arches forms a directed acyclic graph for the relation of generalization (“is a”). Another set forming the directed graph, represents the relation of referred terms (“referred to”). Relation types “is a” and “referred to” form subgraphs.
Principles of IES Operation The thesaurus of the model should be populated and refreshed using an automatic mode. Thus there are the certain difficulties concerning natural language text processing. To overcome these difficulties successfully, the approach offers the multilevel circuit of text processing using the relaxation method to eliminate ambiguities.
At the initial stage of text processing the syntactic analyzer (figure 2) works. It implements the syntactic analysis and decomposition of compound sentences to the simple propositions consisting of subject, predicate and object.
While these operations are being accomplished, the semantics of the sentence is not taken into account. During decomposition, the text of the source documents is transformed into a set of simple statements (propositions) of three listed types (attributive proposition, proposition with relation, existential proposition) which then can be easily subjected to semantic analysis.
It is important to note that relations between concepts are not necessarily conveyed syntactically in text. They can also be conveyed by the structure of the document. There are two types of structural compositions most frequently used in documents: table structure, determines attributive relations; list structure, determines relations of various kinds, between the concept located in the list heading and concepts located in lines.
In order to assure the completeness of analysis it is necessary to allocate relations, set by structures of listed types. This task is done by the structural analyzer, who’s output, as well as for syntactic analyzer, consists of simple propositions reflecting relations between concepts. The analyzer generates them using structural information extracted from the source text as the basis.
The semantic analyzer obtains the data processed by syntactic and structural analyzers, handles them for its turn and populates the system thesaurus with prepared data. If the semantic analyzer finds any variance in source data, caused by its ambiguity or uncertainty, it can address previous level of processing – syntactic or structural analyzer – with the requirement to give another variant of the text treatment. This idea accords with principles of the relaxation method. Some missing branches of concept relations may also be evoked from the existing thesaurus knowledge base.
There is one more task assigned to the semantic analyzer – to eliminate insignificant data. In fact the final model should not be formed by the whole totality of concepts and relations, allocated in the initial documents. First of all, some concepts may just slightly touch the scope of the given problem domain. Sometimes some errors in allocation of concepts and relations may take place because of text specificity or its author's verbiage. Anyway, some mechanism is required that could free the user from dealing with a lot of insignificant details. To achieve this, the semantic analyzer uses a special self-learning filter as a part of the project thesaurus. This filter determines a list of concepts that should not be included in the thesaurus. Relations of a special type “not relevant” may also be settled between the concepts in the thesaurus in order to solve the problem more effectively.
The filter is trained by tracking actions which are user made when editing a diagram. This way we can reveal insignificant concepts and relations in the problem domain to use this knowledge later.
Knowledge Engineering We need to mention that there is one more important opportunity the approach can offer: an opportunity of distribution “on a turn-key basis” of an IS designing tool assigned for usage in the context of a certain problem domain. Such a tool would possess a prepared thesaurus establishing the set of basic concepts and relations and include a trained semantic filter focused over the scope of the problem domain being aimed at. In the architecture framework of IES which is being developed according to the suggested approach, this thesaurus is represented by the separate component called “Problem domain thesaurus” (figure 2).
The project thesaurus directly delivers data needed for production of model diagrams. The structure and sense of the thesaurus content allows translation of it into the model diagram. This is in spite of the fact that there are some minor distinctions between specifications of diagrams that could be used within the approach: UML diagrams and ER diagrams.
Diagrams are displayed in some external modelling environment which is connected to IES through the special buffer module of the model image. Of course, the user may want to correct the obtained model diagram, which is initially generated by the system. But nevertheless, it continues to cooperate with the user, helping him to perform the work.
Upon the user's demand it can generate some new fragments of the model diagram, if there are any new source documents obtained, or expand the model diagram in order to restore missing relations, applying knowledge from the problem domain and the project thesauruses, etc.
The system also traces user's actions made during model diagram editing. Such a feedback mechanism is absolutely necessary for implementing the idea of self-training as applied to the problem domain thesaurus and the semantic filter. Actually, during editing of the model diagram, the user “teaches” the system, providing it with the information about concepts and relations that are of first interest to him and ones that should be excluded from consideration. In such a way, the problem domain thesaurus containing the most authentic and clean information on key concepts and typical relations between them is built. It is populated automatically during editing of the model diagram. Thus, the resulting model diagram and successive modifications made by the user are also a source of information for the IES.
The system tries to recognize semantics in the model diagrams. So a diagram which the user works with is not a senseless set of blocks and connections for a computer any more. Attention is focused on names of elements, their structure, interfacing, etc. All these aspects are analyzed by the system.
Objects and relations allocated in a problem domain, organize a model. When the diagram is built, they remain connected with texts in the source documents library. It is necessary for the user to have an opportunity to supervise the correctness of the constructed model, verifying it directly with the key information from source documents. Reverse referencing from source documents to elements of a model diagram is also needed, because documents are not something immutable. The documents library has a dynamic nature – precepts may be cancelled, or changed in some points, etc. Direct and reverse referencing between source texts and the model assure an opportunity of efficient model updating.
Examples Now we give an example demonstrating some aspects of the approach.
Please note that the approach is being developed for use jointly with the Russian language, where the concepts’ mutual interdependence in sentences is expressed much less ambiguously than in English, at the syntax level.
Let us show how a certain expression is going to be analyzed by the system:
“Educational institutions with government accreditation grant certificates to persons who passed attestation”.
During syntactic analysis the given sentence is split into some connected simple statements which can be easily represented by the semantic network shown on fig. 3.
Fourth International Conference I.TECH 2006 Figure 3. Semantic Network Representing Sample Sentence Structure The semantic analyzer qualifies propositions (1, 2 and 3) such as ones with relations. Thus the verb predicate representing the action “grant” is interpreted by the semantic filter as an operation (class method). But let us assume that such interpretation is not known to semantic filter.
Simple propositions obtained which form marked section of a semantic network after the stage of semantic analysis, are directed to the problem domain thesaurus.
Propositions with relations of such a kind are displayed in the model as objects connected by the relation “referred to”; connection is directed from a subject to an object and represents the predicate (see fig. 4).
Figure 4. Model Framework Created on the Sentence Part of the model received as a result of analysis of a given sentence, could be automatically attached to the existing model by a set of “is a” connections, revealed by the semantics comparison.
Besides that, if the problem domain thesaurus contains information about other connections between these objects and ones in the problem domain, these connections will also be restored in the model.
So, let us return to the necessity that the action “grant” should be interpreted as a method.
If it does not happen automatically, then the user manually creates the method “grant” in the object “Education institution”. After that, as a result of the semantics comparison of the operation name assigned by the user with the text of source sentence, the semantic filter is trained to interpret the verb “to grant” as the method (operation) at a later time.
Analyzing a similar text subsequently, the system should automatically add a corresponding object operation to the model. The thesaurus of the model is populated and refreshed in an automatic mode.
Knowledge Engineering Bibliography [Aiello, 2000] M. Aiello, C. Monz, L. Todoran. Combining Linguistic and Spatial Information for Document Analysis, In:
Content-Based Multimedia Information Access. CID, 2000, pp. 266-275.
[Connolly, 1999] T.M. Connolly, C.E. Begg. Database Systems. A Practical Approach to Design, Implementation, and Management. Addison Wesley Longman, Inc, 1999.
[Fowler, 1997] M. Fowler, C. Scott. UML Distilled: Applying the Standard Object Modeling Language. Addison Wesley Longman, Inc, 1997.
[Johnson-Laird, 1983] P.N. Johnson-Laird. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, 1983.
[Katz, 1964] J.J. Katz, J.A. Fodor. The Structure of Language, Prentice-Hall, 1964.
[Larman, 2000] C. Larman. Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design. Prentice Hall, Inc, 2000.
[Miller, 1990] G. Miller. Wordnet: An on-line lexical database. In: International Journal of Lexicography, 3(4), 1990.
Authors' Information Maxim Loginov - Perm State University, Assistant of the Department of Computer Science; PSU, 15, Bukirev St., Perm, 614990, Russia; e-mail: firstname.lastname@example.org Alexander Mikov – Institute of Computing, Director; 40/1-28, Aksaiskaya St., Krasnodar, Russia;
e-mail: email@example.com TEXT-TO-TEXT MACHINE TRANSLATION SYSTEM (TTMT SYSTEM) – A NOVEL APPROACH FOR LANGUAGE ENGINEERING Todorka Kovacheva, Koycho Mitev, Nikolay Dimitrov Abstract: The purpose of the current paper is to present the developed methodology for automatic machine translation. As a result the Text-to-Text Machine Translation System (TTMT System) is designed. The TTMT System is developed as hybrid architecture, combining different machine translation approaches. The included languages in the base version are English, Russian, German and Bulgarian. The TTMT System is designed as a universal polyglot. The architecture of the system is highly modular and allows other languages easy to be included. It uses a digital script and method for communication.
Keywords: machine translation (MT), natural language processing (NLP), language engineering (LE), text-totext translation (TTT), text-to-text machine translation system (TTMT System), universal polyglot, digital language.