character encoding - Using Java Normalizer to convert accent ascii to non-accent but to exclude some symboles -


i have set of data have accented ascii in them. want convert accent plain english alphabets. achieve following code :

import java.text.normalizer; import java.util.regex.pattern;  public string deaccent(string str) {     string nfdnormalizedstring = normalizer.normalize(str, normalizer.form.nfd);      pattern pattern = pattern.compile("\\p{incombiningdiacriticalmarks}+");     return pattern.matcher(nfdnormalizedstring).replaceall(""); } 

but code missing exclude characters, don't know how can exclude characters conversion, example want exclude letter "ü" word düsseldorf when convert, doesn't turn dusseldorf word. there way pass exclude list method or matcher , don't convert accented characters ?

do not use normalization remove accents!

for example, following letters not asciified using method:

  • ł

  • đ

  • ħ

you may want split ligatures œ separate letters (i.e. oe).

try this:

private static final string tab_00c0 = "" +         "aaaaaaaceeeeiiii" +         "dnooooo×ouuuÜyts" + // <-- note accented letter wanted                               //     , preserved multiplication sign         "aaaaaaaceeeeiiii" +         "dnooooo÷ouuuüyty" + // <-- note accented letter , preserved division sign         "aaaaaaccccccccdd" +         "ddeeeeeeeeeegggg" +         "gggghhhhiiiiiiii" +         "iijjjjkkklllllll" +         "lllnnnnnnnnnoooo" +         "oooorrrrrrssssss" +         "ssttttttuuuuuuuu" +         "uuuuwwyyyzzzzzzs";  public static string toplain(string source) {     stringbuilder sb = new stringbuilder(source.length());     (int = 0; < source.length(); i++) {         char c = source.charat(i);         switch (c) {             case 'ß':                 sb.append("ss");                 break;             case 'Œ':                 sb.append("oe");                 break;             case 'œ':                 sb.append("oe");                 break;             // insert more ligatures want support              // or other letters want convert in non-standard way here             // recommend take at: æ þ ð fl fi             default:                 if (c >= 0xc0 && c <= 0x17f) {                     c = tab_00c0.charat(c - 0xc0);                 }                 sb.append(c);         }     }     return sb.tostring(); } 

Comments

Popular posts from this blog

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -

google shop client API returns 400 bad request error while adding an item -