Text repair based on n-gram language model

Closed job
no avatar
kam
Employer
3 deals
Published:
Valid until:

Job description

Well, my problem is that I have some texts that have some mistakes or lack some words inside of them. I would like to have a tool that would allow me to repair them automatically using n-gram language model in APRA format. My text in format like one sentence per line in UTF-8, I need the tool to work under Ubuntu linux - I am using version 16.04.

In the text there may be following errors:

- lack of word or words in a sentence (with unk symbol): e.g. "Ala ma unk i dwa psy." should be corrected to "Ala ma kota i dwa psy.". NOTE. There can be many unk in one sentence.

- wrong for form or inflection: e.g. "Ala kupił zielony samochód." and should be "Ala kupiła zielony samochód."

- (very rare) totally wrong word in text: e.g. "Rąb ma pole o wymiarach." and should be "Romb ma pole o wymiarach."

All this should be repaired based on n-gram language model (so statistical model). Such a model has for each n-gram probabilities such probabilities usually are calculated as conditional ones as described here https://en.wikipedia.org/wiki/Language_model#n-gram_models and here https://www.cs.uni-potsdam.de/ml/publications/emnlp2005.pdf . So in fact the goal is to find the most probable sentence with given words and find missing content and input it.

I will use 3-gram, 4-gram, 5-gram, 6-gram and 7-gram models. So for all of such models the tools should be able to provide results based on all n-grams possible.

Here is sample n-gram model you can use it for testing: https://mega.nz/#!th83lRII!ZS4Mr3NOxrMst941yQXpDuobHcA6yUgJJKMNu9DUBJE

This is very small LM so the tools should be fast, work in parallel for files like hundreds of GB.