Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [ ("hello, world", "hello, world"), ("42", "42"), ("你好,世界", "你好,世界"), ( "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.", "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.", ), ( "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.", "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.", ), ( "Љубазни фењерџија чађавог лица хоће да ми покаже штос.", "љубазни фењерџија чађавог лица хоће да ми покаже штос.", ), ( "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.", "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.", ), ( "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.", "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.", ), ( "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.", "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.", ), ( "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.", "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.", ) ] for (given, expected) in examples: assert remove_diacritics(given) == expectedCase-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))View Discussion
Improve Article
Save Article
View Discussion
Improve Article
Save Article
String accents are special string characters adapted from languages of other accents. In this article, we are going to remove ascents from a string.
Examples:
Input: orčpžsíáýd
Output: orcpzsiayd
Input: stävänger
Output: stavanger
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
Syntax:
output_string = unidecode.unidecode(target_string )
Below are some examples which depict how to remove ascents from a string:
Example 1:
Python3
import unidecode
string = "orčpžsíáýd"
print('\nOriginal String:', string)
outputString = unidecode.unidecode(string)
print('\nNew String:', outputString)
Output:
Original String: orčpžsíáýd New String: orcpzsiaydExample 2:
Python3
import unidecode
string = "stävänger"
print('\nOriginal String:', string)
outputString = unidecode.unidecode(string)
print('\nNew String:', outputString)
Output:
Original String: stävänger New String: stavangerExample 3:
Python3
import unidecode
stringList = ["hell°", "tromsø", "stävänger", "ölut"]
print('\nOriginal List of Strings:\n', stringList)
for i in range(len(stringList)):
stringList[i] = unidecode.unidecode(stringList[i])
print('\nNew List of Strings:\n', stringList)
Output:
Original List of Strings: ['hell°', 'tromsø', 'stävänger', 'ölut'] New List of Strings: ['helldeg', 'tromso', 'stavanger', 'olut']