I am working on a story generator, and I wanted a quick way to ensure that the names generated for characters are not too close, to avoid potential confusion.

I thought the best way to do this was to use n-grams, a contiguous sequence of n items from a given sample of text or speech.

My algorithm works in the following manner:

  • Create a temporary array to allocate word scores (call it wordScores or something similar)
  • Iterate on all letters of the first word of a pair of words
  • Create another temporary array for character scores (characScores)
  • Using the id of this letter in the first word, get the surrounding letters in the second word
  • If a surrounding letter is equal to the letter of the first word, append a distance-adjusted score to characScores. Else, append a 0 or continue.
  • Once all the surrounding letters are checked, get the max value from characScores and append it to wordScores
  • Empty characScores and repeat for every letters
  • Once all loops have ended, return the sum of characScores divided by its length; this will be your distance metric

This will give you a similarity value bounded between 0 and 1 inclusive.

I implemented a Parameters class that allows you to specify scores for your window, and also allows to easily change the window size. Depending on your intended purpose, you might have to change the window’s size and the default scores.

The code is available on my Github: SimiliStrings.