Natural Language Processing with Probabilistic Model(Word probabilities, Dynamic programming, Minimum edit distance, Autocorrect)

sabankara
7 min readApr 17, 2023

--

Autocorrect

Autocorrect in natural language processing (NLP) refers to a feature or algorithm that automatically corrects or suggests corrections for errors or mistakes in text input. Autocorrect is commonly used in applications such as messaging, word processing, and search engines to help users correct spelling errors, typos, and grammatical mistakes.

Autocorrect in NLP typically involves several steps, including:

  1. Error detection: The NLP algorithm identifies potential errors in the input text, such as misspelled words, repeated letters, or other common mistakes.
  2. Candidate generation: The algorithm generates a list of possible corrections for each error by considering nearby words, context, and common language patterns.
  3. Candidate ranking: The algorithm ranks the generated corrections based on factors such as frequency of occurrence, similarity to the input word, and context.
  4. Correction suggestion: The top-ranked correction for each error is presented to the user as a suggestion, which can be automatically applied or manually selected by the user.
  5. User feedback: The autocorrect system may also collect feedback from users, such as when a correction is accepted or rejected, to improve its accuracy over time.

Autocorrect in NLP is typically trained on large datasets of text data, including correct and incorrect text pairs, to learn patterns and make accurate corrections. It may also incorporate machine learning techniques, such as rule-based systems, statistical models, or deep learning algorithms, to improve its accuracy and adapt to different languages, dialects, and writing styles.

Building the model

1. Identify the misspelled word

When identifying the misspelled word, it can be checked whether it is in the vocabulary. If it’s not found, it’s probably a typo.

2. Find strings n edit distance away

3. Filter candidates

In this step, you want to take all the words created above and then store only the actual words that are meaningful and that you can find in the vocabulary.

4. Calculating word probabilities

Note that you store the word count and then you can use that to build probabilities. If you want to create a slightly more complex autocorrect, you can follow two words going side by side instead. You can use the previous word to decide later. For example, which combination is more likely, friend or their friend?

Minimum edit distance

Minimum edit distance allows you to:

  • Evaluate similarity between two strings
  • Find the minimum number of edits between two strings
  • Implement spelling correction, document similarity, machine translation, DNA sequencing, and more

Remember that the edits include:

  • Insert (add a letter) ‘to’: ‘top’, ‘two’ …
  • Delete (remove a letter) ‘hat’: ‘ha’, ‘at’, ‘ht’
  • Replace (change 1 letter to another) ‘jaw’: ‘jar’, ‘paw’, …

Here is a concrete example where we calculate the cost (i.e. edit distance) between two strings.

The larger your strings, the more difficult it becomes to calculate the minimum edit distance. Therefore, the minimum editing distance algorithm must now be used!

Minimum edit distance algorithm

The Minimum Edit Distance (MED) algorithm, also known as Levenshtein distance or editing distance, is a technique commonly used in natural language processing (NLP) to calculate the similarity or difference between two sequences. It measures the number of single-character edits (insertions, deletions, or substitutions) needed to convert one string to another.

The MED algorithm is widely used in various NLP tasks such as spell checking, DNA sequence alignment, and text similarity measurement. It can be implemented using dynamic programming, which is an efficient way of calculating the MED between two sequences.

When computing the minimum edit distance, you would start with a source word and transform it into the target word. Let’s look at the following example:

To go from #→# you need a cost of 0. From p→# you get 1, because that is the cost of a delete. ps is 2 because that is the minimum cost one could use to get from p to s. You can keep going this way by populating one element at a time, but it turns out there is a faster way to do this!

There are three equations:

  • D[i,j] = D[i-1, j] + del_cost: this indicates you want to populate the current cell (i,j) by using the cost in the cell found directly above.
  • D[i,j] = D[i, j-1] + ins_cost: this indicates you want to populate the current cell (i,j) by using the cost in the cell found directly to its left.
  • D[i,j] = D[i-1, j-1] + rep_cost: the rep cost can be 2 or 0 depending if you are going to actually replace it or not.

At every time step you check the three possible paths where you can come from and you select the least expensive one. Once you are done, you get the following:

To summarize, you have seen the levenshtein distance which specifies the cost per operation. If you need to reconstruct the path of how you got from one string to the other, you can use a backtrace. You should keep a simple pointer in each cell letting you know where you came from to get there. So you know the path taken across the table from the top left corner, to the bottom right corner. You can then reconstruct it.

This method for computation instead of brute force is a technique known as dynamic programming. You first solve the smallest subproblem first and then reusing that result you solve the next biggest subproblem, saving that result, reusing it again, and so on. This is exactly what you did by populating each cell from the top right to the bottom left. It’s a well-known technique in computer science!

Here’s a basic pseudocode for the Minimum Edit Distance algorithm:

function minimum_edit_distance(str1, str2):
m = length of str1
n = length of str2
Initialize a 2D array D with dimensions (m+1) x (n+1) and set D[0][0] = 0

for i from 1 to m:
D[i][0] = i

for j from 1 to n:
D[0][j] = j

for i from 1 to m:
for j from 1 to n:
if str1[i-1] == str2[j-1]:
cost = 0
else:
cost = 1

D[i][j] = minimum of (D[i-1][j] + 1), (D[i][j-1] + 1), (D[i-1][j-1] + cost)

return D[m][n]

In this pseudocode, str1 and str2 are the input strings for which the MED needs to be calculated. The algorithm uses a 2D array D to store the intermediate results. The outer loop iterates over the characters of str1, and the inner loop iterates over the characters of str2. The algorithm calculates the cost of each edit operation (insertion, deletion, or substitution) and stores the minimum cost in D[i][j]. Finally, the value at D[m][n] represents the minimum edit distance between str1 and str2.

The MED algorithm has a time complexity of O(mn), where m and n are the lengths of the input strings, and a space complexity of O(mn) as well, as it requires storing the intermediate results in a 2D array. However, there are optimized versions of the algorithm that can reduce the space complexity to O(min(m, n)) if memory usage is a concern.

Referance

Younes Bensouda Mourri, Natural Language Processing with Classification and Vector Spaces

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see

--

--

sabankara
sabankara

Written by sabankara

I am pushing the limits of technology by using my imagination in the field of NLP. I am constantly improving myself by producing new ideas.

No responses yet