comment 1

Simple metrics for TextMining

Within the european research project ROBUST (Risk and Opportunity management of huge-scale BUSiness communiTy cooperation) which was conducted a few years ago by several universities and companies such as IBM and SAP, some interesting research has been conducted.

One chapter about “Simple Metrics for TextMining” was developed and written by me.
I thought it might be useful to bring it up again.

The original paper is available online:

Complexity

Measuring the complexity of a text can be useful for several purposes. Possible scenarios are the grading of foreign language texts for language learners; grading literature for children and adults; or simply the measurement of the readability of a given text.
The simpler a text the more it resembles spoken language. The main characteristic of spoken language in comparison to written language is its simplicity.
Sentences tend to be shorter, words are mostly common words, the grammar is easier, there are less subordinate clauses and so on.
In the year 1971 the Swedish scholar Carl Hugo Björnsson developed a formula to indicate the readability of a text. It relies on two assumptions: simple texts consist of short sentences and short words.
It is expressed by a value called LIX:

LIX(text) = TotalWords/Sentences + (LongWords x 100)/TotalWords

The outcome is usually a value between 20 and 70. The formula is language independent and therefore very useful in a multilingual environment. The only requirement is that the language has letters.
Since there are few values to be calculated and the only requirement for pre-processing is sentence splitting and tokenisation the metric is very useful for a large amount of data.
We have evaluated LIX on a small German corpus of newspaper texts.
The corpus consists of 402 newspaper articles annotated with metadata.
The average LIX value of all articles is 19.80.

Two sub-corpora were built: the first is called ART and consists of 78 articles from the magazine ART, which is used as an example for complex language. The average LIX value is 24.80. A second sub-corpus was built: it consists of 88 articles from the newspaper “Hamburger Morgenpost”, which serves as an example for simple language.
The average LIX value is 17.16. The LIX values of the two sub-corpora are correct according to the gold standard classification.
The LIX values are overall very low, which may issue from the very basic pre-processing in the test script.

Another way to evaluate complexity-measurements is by comparing some articles from English Wikipedia with the respective articles written in Simple English.

Measuring informativity with CFR (Content Function Ratio)

Measuring the informativity of a text depends on the definition of informativity. A text is informative, when there are a lot of pieces of information one could capture without knowing the context or the author. So for example: “He is my best friend” is less informative than “Max Mustermann has a best friend called Martin Muster”.

A simple metric for measuring the informativity in a given text is the relative amount of content words to non-content words. Content words are nouns, proper nouns, verbs, and adjectives. Some definitions include adverbs and some prepositions, but a test showed that those were not useful. The content function ratio (CFR) is calculated like this:

CFR(text) = AmountOfContentWordTags / AmountOfFunctionWordTags

We did a test using the same corpora as in the previous chapter.
ART was classified as very informative and “Hamburger Morgenpost” as not very informative:

CFR(whole corpus) = 3.30
CFR(ART) = 3.26
CFR(Hamburger Morgenpost) = 2.79

The results again correspond to the gold standard classification.
Another observation was made: there seems to be a correlation between the informativity of a text and the subjectivity.
Very informative texts are very objective and not very informative texts are very subjective. This might issue from the linguistic features involved. Further tests should be done to confirm this observation. The CFR metric can be used for measuring the informativity
and maybe even the subjectivity of a text.

1 Comment so far

Leave a Reply