Projects

Imprint Extraction

The aim of this project was to extract addresses, phone numbers, e-mail addresses and names from German company websites. Imprints are a semi-sructured source of information. It was necessary to create a framework that used dictionaries, regular expressions, a parser and heuristics to reliably extract imprints and save them in a database. The framework worked very well. Unfortunately the startup is not operational anymore.

Ad-Recommendation

The aim of the project was to show ads, that relate to the content of newspaper articles. It should look like the journalist had picked the products by hand. It was realized using the most important and descriptive part of a newspaper article as input – its title and an index full of 30-60 million products with titles and descriptions. From the title only keywords were extracted, so mainly nouns, using a POS-Tagger. Those keywords were translated into queries to search the large Elasticsearch index. The best matching products were selected and displayed automatically. It worked well when the title was very descriptive and contained concrete nouns.

simpLibri

The aim of this project was to automatize the creation of easy readers, to help foreign language learners to understand the vocabulary of books they are interested in. The unknown words were translated and added as footnotes. Every eBook (HTML, PDF, EPUB) could be transformed into a simpLibri based on the individual language level of the learner (A1-C2). The output was an enriched eBook with additional translations. The basis-vocabulary was automatically learned using statistics and the bilingual dictionary was extracted form open sources such as dict.cc and Wiktionary.

Badaparola

The aim of this project was to create a tool that finds new words in Italian and selects the most interesting ones. The tool downloaded and “read” newspaper articles form 5 important Italian newspapers every day. The new word candidates were found using a big reference dictionary. Those candidates were saved with metadata in a database. To find the most relevant and interesting words, several statistics and metrics were used to score the potential neologisms.

Newspaper Article Recommendation

The aim of the project was to find newspaper articles similar to the one the user was reading and recommend them to the user. This was implemented using Elasticsearch. There was an index containing all newspaper articles per publisher. Finding related newspaper articles is pretty easy using the built-in query called “MoreLikeThis”. The input to that query is the content of the current newspaper article. The algorithm finds all related newspaper articles in the index scored by relevance. Internally keywords are extracted from the content (using Tf-IDF) and translated into queries. This works pretty good and in realtime. The setup of this recommendation engine is pretty easy and quick.

Search Engine for Foreign Language Learners

Foreign language learners want to read content they are interested in. In order to understand the text, it needs to be easy enough, according to their language level (A2-C2). Therefore they need a special search engine that contains text of good quality graded by language level. There are different ways to implement that. We used the LIX metric, which was developed by a Swedish pedagogue. In the backend we indexed and graded different books and texts from Project Gutenberg, Wikipedia, etc. The result is an enriched search engine, where language learners can find original reading material according to their language level and interests.