Panduan Pengekodan Watak

1. Gambaran keseluruhan

Dalam tutorial ini, kita akan membincangkan asas pengekodan watak dan bagaimana kita mengatasinya di Java.

2. Kepentingan Pengekodan Karakter

Kita sering kali berurusan dengan teks yang terdiri daripada pelbagai bahasa dengan skrip penulisan yang pelbagai seperti Latin atau Arab. Setiap watak dalam setiap bahasa entah bagaimana harus dipetakan ke sekumpulan satu dan nol. Sungguh hairan bahawa komputer dapat memproses semua bahasa kita dengan betul.

Untuk melakukan ini dengan betul, kita perlu memikirkan pengekodan watak. Tidak melakukannya sering boleh menyebabkan kehilangan data dan juga kelemahan keselamatan.

Untuk memahami perkara ini dengan lebih baik, mari kita tentukan kaedah untuk menyahkod teks di Java:

String decodeText(String input, String encoding) throws IOException { return new BufferedReader( new InputStreamReader( new ByteArrayInputStream(input.getBytes()), Charset.forName(encoding))) .readLine(); }

Perhatikan bahawa teks input yang kami beri di sini menggunakan pengekodan platform lalai.

Sekiranya kita menjalankan kaedah ini dengan input sebagai "Corak fasad adalah pola reka bentuk perisian." dan pengekodan sebagai "US-ASCII" , ia akan menghasilkan:

The fa��ade pattern is a software design pattern.

Baiklah, tidak seperti yang kita harapkan.

Apa yang boleh berlaku? Kami akan cuba memahami dan membetulkannya dalam sisa tutorial ini.

3. Asas

Sebelum menggali lebih mendalam, mari kita tinjau tiga istilah dengan cepat: pengekodan , carta , dan titik kod .

3.1. Pengekodan

Komputer hanya dapat memahami perwakilan binari seperti 1 dan 0 . Memproses apa sahaja memerlukan pemetaan dari teks dunia nyata hingga perwakilan binernya. Pemetaan ini adalah apa yang kita ketahui sebagai pengekodan watak atau sama seperti pengekodan .

Sebagai contoh, huruf pertama dalam mesej kami, "T", dalam AS-ASCII mengekodkan ke "01010100".

3.2. Charset

Pemetaan watak ke perwakilan binari mereka boleh sangat berbeza dari segi watak yang mereka sertakan. Bilangan watak yang disertakan dalam pemetaan dapat bervariasi dari hanya beberapa hingga semua watak dalam penggunaan praktikal. Kumpulan watak yang termasuk dalam definisi pemetaan secara formal disebut charset .

Sebagai contoh, ASCII mempunyai charset 128 aksara.

3.3. Titik Kod

Titik kod adalah abstraksi yang memisahkan watak dari pengekodan sebenarnya. A titik kod adalah rujukan integer untuk watak tertentu.

Kita boleh mewakili bilangan bulat itu sendiri dalam bentuk perpuluhan biasa atau alternatif seperti heksadesimal atau oktal. Kami menggunakan pangkalan gantian untuk kemudahan merujuk sejumlah besar.

Contohnya, huruf pertama dalam pesan kami, T, dalam Unicode mempunyai titik kod "U + 0054" (atau 84 dalam perpuluhan).

4. Memahami Skema Pengekodan

Pengekodan watak boleh mengambil pelbagai bentuk bergantung pada bilangan watak yang dikodkannya.

Bilangan watak yang dikodkan mempunyai hubungan langsung dengan panjang setiap perwakilan yang biasanya diukur sebagai jumlah bait. Memiliki lebih banyak watak untuk mengekod pada dasarnya bermaksud memerlukan perwakilan binari yang lebih panjang.

Mari kita laksanakan beberapa skema pengekodan yang popular dalam praktik hari ini.

4.1. Pengekodan Bait Tunggal

Salah satu skema pengekodan terawal, yang disebut ASCII (American Standard Code for Information Exchange) menggunakan skema pengekodan bait tunggal. Ini pada dasarnya bermaksud bahawa setiap watak dalam ASCII diwakili dengan nombor binari tujuh-bit. Ini masih memberikan sedikit percuma dalam setiap bait!

Set 128-karakter ASCII merangkumi huruf Inggeris dalam huruf kecil dan besar, angka, dan beberapa watak khas dan kawalan.

Mari tentukan kaedah mudah di Java untuk memaparkan perwakilan binari untuk watak di bawah skema pengekodan tertentu:

String convertToBinary(String input, String encoding) throws UnsupportedEncodingException { byte[] encoded_input = Charset.forName(encoding) .encode(input) .array(); return IntStream.range(0, encoded_input.length) .map(i -> encoded_input[i]) .mapToObj(e -> Integer.toBinaryString(e ^ 255)) .map(e -> String.format("%1$" + Byte.SIZE + "s", e).replace(" ", "0")) .collect(Collectors.joining(" ")); }

Sekarang, watak 'T' memiliki titik kode 84 di AS-ASCII (ASCII disebut sebagai AS-ASCII di Jawa).

Dan jika kita menggunakan kaedah utiliti, kita dapat melihat perwakilan binernya:

assertEquals(convertToBinary("T", "US-ASCII"), "01010100");

Ini, seperti yang kami jangkakan, adalah representasi binari tujuh bit untuk watak 'T'.

ASCII yang asli meninggalkan bit yang paling penting dari setiap bait yang tidak digunakan. Pada masa yang sama, ASCII telah meninggalkan banyak watak yang tidak terwakili, terutama untuk bahasa bukan Inggeris.

Ini menyebabkan usaha untuk menggunakan bit yang tidak digunakan dan memasukkan 128 karakter tambahan.

Terdapat beberapa variasi skema pengekodan ASCII yang dicadangkan dan diadopsi dari masa ke masa. Ini secara longgar disebut sebagai "sambungan ASCII".

Banyak peluasan ASCII mempunyai tahap kejayaan yang berbeza tetapi jelas, ini tidak cukup baik untuk penggunaan yang lebih luas kerana masih banyak watak yang tidak diwakili.

Salah satu sambungan ASCII yang lebih popular ialah ISO-8859-1 , juga disebut sebagai "ISO Latin 1".

4.2. Pengekodan Berbilang Bait

Oleh kerana keperluan untuk menampung semakin banyak watak semakin meningkat, skema pengekodan byte tunggal seperti ASCII tidak dapat dilaksanakan.

Ini menimbulkan skema pengekodan multi-byte yang mempunyai kapasiti yang jauh lebih baik walaupun dengan peningkatan keperluan ruang.

BIG5 dan SHIFT-JIS adalah contoh skema pengekodan aksara berbilang bait yang mula menggunakan satu serta dua bait untuk mewakili carta yang lebih luas . Sebilangan besar ini diciptakan untuk keperluan untuk mewakili skrip Cina dan serupa yang mempunyai jumlah watak yang jauh lebih tinggi.

Sekarang mari kita panggil kaedah convertToBinary dengan input sebagai '語', watak Cina, dan pengekodan sebagai "Big5":

assertEquals(convertToBinary("語", "Big5"), "10111011 01111001");

Output di atas menunjukkan bahawa pengekodan Big5 menggunakan dua bait untuk mewakili watak '語'.

Senarai pengekodan aksara yang komprehensif, bersama dengan aliasnya, dikendalikan oleh Lembaga Nombor Antarabangsa.

5. Unicode

Tidak sukar untuk memahami bahawa walaupun pengekodan adalah penting, penyahkodan sama pentingnya untuk memahami perwakilan. Ini hanya mungkin dilakukan dalam praktik sekiranya skema pengekodan yang konsisten atau serasi digunakan secara meluas.

Skema pengekodan yang berbeza yang dikembangkan secara terpisah dan dipraktikkan di geografi tempatan mula menjadi mencabar.

Cabaran ini menimbulkan standard pengekodan tunggal yang disebut Unicode yang mempunyai kemampuan untuk setiap watak yang mungkin ada di dunia . Ini termasuk watak-watak yang sedang digunakan dan juga watak-watak yang tidak berfungsi!

Well, that must require several bytes to store each character? Honestly yes, but Unicode has an ingenious solution.

Unicode as a standard defines code points for every possible character in the world. The code point for character ‘T' in Unicode is 84 in decimal. We generally refer to this as “U+0054” in Unicode which is nothing but U+ followed by the hexadecimal number.

We use hexadecimal as the base for code points in Unicode as there are 1,114,112 points, which is a pretty large number to communicate conveniently in decimal!

How these code points are encoded into bits is left to specific encoding schemes within Unicode. We will cover some of these encoding schemes in the sub-sections below.

5.1. UTF-32

UTF-32 is an encoding scheme for Unicode that employs four bytes to represent every code point defined by Unicode. Obviously, it is space inefficient to use four bytes for every character.

Let's see how a simple character like ‘T' is represented in UTF-32. We will use the method convertToBinary introduced earlier:

assertEquals(convertToBinary("T", "UTF-32"), "00000000 00000000 00000000 01010100");

The output above shows the usage of four bytes to represent the character ‘T' where the first three bytes are just wasted space.

5.2. UTF-8

UTF-8 is another encoding scheme for Unicode which employs a variable length of bytes to encode. While it uses a single byte to encode characters generally, it can use a higher number of bytes if needed, thus saving space.

Let's again call the method convertToBinary with input as ‘T' and encoding as “UTF-8”:

assertEquals(convertToBinary("T", "UTF-8"), "01010100");

The output is exactly similar to ASCII using just a single byte. In fact, UTF-8 is completely backward compatible with ASCII.

Let's again call the method convertToBinary with input as ‘語' and encoding as “UTF-8”:

assertEquals(convertToBinary("語", "UTF-8"), "11101000 10101010 10011110");

As we can see here UTF-8 uses three bytes to represent the character ‘語'. This is known as variable-width encoding.

UTF-8, due to its space efficiency, is the most common encoding used on the web.

6. Encoding Support in Java

Java supports a wide array of encodings and their conversions to each other. The class Charset defines a set of standard encodings which every implementation of Java platform is mandated to support.

This includes US-ASCII, ISO-8859-1, UTF-8, and UTF-16 to name a few. A particular implementation of Java may optionally support additional encodings.

There are some subtleties in the way Java picks up a charset to work with. Let's go through them in more details.

6.1. Default Charset

The Java platform depends heavily on a property called the default charset. The Java Virtual Machine (JVM) determines the default charset during start-up.

This is dependent on the locale and the charset of the underlying operating system on which JVM is running. For example on MacOS, the default charset is UTF-8.

Let's see how we can determine the default charset:

Charset.defaultCharset().displayName();

If we run this code snippet on a Windows machine the output we get:

windows-1252

Now, “windows-1252” is the default charset of the Windows platform in English, which in this case has determined the default charset of JVM which is running on Windows.

6.2. Who Uses the Default Charset?

Many of the Java APIs make use of the default charset as determined by the JVM. To name a few:

  • InputStreamReader and FileReader
  • OutputStreamWriter and FileWriter
  • Formatter and Scanner
  • URLEncoder and URLDecoder

So, this means that if we'd run our example without specifying the charset:

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();

then it would use the default charset to decode it.

And there are several APIs that make this same choice by default.

The default charset hence assumes an importance which we can not safely ignore.

6.3. Problems With the Default Charset

As we have seen that the default charset in Java is determined dynamically when the JVM starts. This makes the platform less reliable or error-prone when used across different operating systems.

For example, if we run

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();

on macOS, it will use UTF-8.

If we try the same snippet on Windows, it will use Windows-1252 to decode the same text.

Or, imagine writing a file on a macOS, and then reading that same file on Windows.

It's not difficult to understand that because of different encoding schemes, this may lead to data loss or corruption.

6.4. Can We Override the Default Charset?

The determination of the default charset in Java leads to two system properties:

  • file.encoding: The value of this system property is the name of the default charset
  • sun.jnu.encoding: The value of this system property is the name of the charset used when encoding/decoding file paths

Now, it's intuitive to override these system properties through command line arguments:

-Dfile.encoding="UTF-8" -Dsun.jnu.encoding="UTF-8"

However, it is important to note that these properties are read-only in Java. Their usage as above is not present in the documentation. Overriding these system properties may not have desired or predictable behavior.

Hence, we should avoid overriding the default charset in Java.

6.5. Why Is Java Not Solving This?

There is a Java Enhancement Proposal (JEP) which prescribes using “UTF-8” as the default charset in Java instead of basing it on locale and operating system charset.

This JEP is in a draft state as of now and when it (hopefully!) goes through it will solve most of the issues we discussed earlier.

Note that the newer APIs like those in java.nio.file.Files do not use the default charset. The methods in these APIs read or write character streams with charset as UTF-8 rather than the default charset.

6.6. Solving This Problem in Our Programs

We should normally choose to specify a charset when dealing with text instead of relying on the default settings. We can explicitly declare the encoding we want to use in classes which deal with character-to-byte conversions.

Luckily, our example is already specifying the charset. We just need to select the right one and let Java do the rest.

We should realize by now that accented characters like ‘ç' are not present in the encoding schema ASCII and hence we need an encoding which includes them. Perhaps, UTF-8?

Let's try that, we will now run the method decodeText with the same input but encoding as “UTF-8”:

The façade pattern is a software-design pattern.

Bingo! We can see the output we were hoping to see now.

Here we have set the encoding we think best suits our need in the constructor of InputStreamReader. This is usually the safest method of dealing with characters and byte conversions in Java.

Similarly, OutputStreamWriter and many other APIs supports setting an encoding scheme through their constructor.

6.7. MalformedInputException

When we decode a byte sequence, there exist cases in which it's not legal for the given Charset, or else it's not a legal sixteen-bit Unicode. In other words, the given byte sequence has no mapping in the specified Charset.

There are three predefined strategies (or CodingErrorAction) when the input sequence has malformed input:

  • IGNORE will ignore malformed characters and resume coding operation
  • REPLACE will replace the malformed characters in the output buffer and resume the coding operation
  • REPORT will throw a MalformedInputException

The default malformedInputAction for the CharsetDecoder is REPORT, and the default malformedInputAction of the default decoder in InputStreamReader is REPLACE.

Let's define a decoding function that receives a specified Charset, a CodingErrorAction type, and a string to be decoded:

String decodeText(String input, Charset charset, CodingErrorAction codingErrorAction) throws IOException { CharsetDecoder charsetDecoder = charset.newDecoder(); charsetDecoder.onMalformedInput(codingErrorAction); return new BufferedReader( new InputStreamReader( new ByteArrayInputStream(input.getBytes()), charsetDecoder)).readLine(); }

So, if we decode “The façade pattern is a software design pattern.” with US_ASCII, the output for each strategy would be different. First, we use CodingErrorAction.IGNORE which skips illegal characters:

Assertions.assertEquals( "The faade pattern is a software design pattern.", CharacterEncodingExamples.decodeText( "The façade pattern is a software design pattern.", StandardCharsets.US_ASCII, CodingErrorAction.IGNORE));

For the second test, we use CodingErrorAction.REPLACE that puts � instead of the illegal characters:

Assertions.assertEquals( "The fa��ade pattern is a software design pattern.", CharacterEncodingExamples.decodeText( "The façade pattern is a software design pattern.", StandardCharsets.US_ASCII, CodingErrorAction.REPLACE));

For the third test, we use CodingErrorAction.REPORT which leads to throwing MalformedInputException:

Assertions.assertThrows( MalformedInputException.class, () -> CharacterEncodingExamples.decodeText( "The façade pattern is a software design pattern.", StandardCharsets.US_ASCII, CodingErrorAction.REPORT));

7. Other Places Where Encoding Is Important

We don't just need to consider character encoding while programming. Texts can go wrong terminally at many other places.

The most common cause of problems in these cases is the conversion of text from one encoding scheme to another, thereby possibly introducing data loss.

Let's quickly go through a few places where we may encounter issues when encoding or decoding text.

7.1. Text Editors

In most of the cases, a text editor is where texts originate. There are numerous text editors in popular choice including vi, Notepad, and MS Word. Most of these text editors allow for us to select the encoding scheme. Hence, we should always make sure they are appropriate for the text we are handling.

7.2. File System

After we create texts in an editor, we need to store them in some file system. The file system depends on the operating system on which it is running. Most operating systems have inherent support for multiple encoding schemes. However, there may still be cases where an encoding conversion leads to data loss.

7.3. Network

Texts when transferred over a network using a protocol like File Transfer Protocol (FTP) also involve conversion between character encodings. For anything encoded in Unicode, it's safest to transfer over as binary to minimize the risk of loss in conversion. However, transferring text over a network is one of the less frequent causes of data corruption.

7.4. Databases

Most of the popular databases like Oracle and MySQL support the choice of the character encoding scheme at the installation or creation of databases. We must choose this in accordance with the texts we expect to store in the database. This is one of the more frequent places where the corruption of text data happens due to encoding conversions.

7.5. Browsers

Finally, in most web applications, we create texts and pass them through different layers with the intention to view them in a user interface, like a browser. Here as well it is imperative for us to choose the right character encoding which can display the characters properly. Most popular browsers like Chrome, Edge allow choosing the character encoding through their settings.

8. Conclusion

In this article, we discussed how encoding can be an issue while programming.

We further discussed the fundamentals including encoding and charsets. Moreover, we went through different encoding schemes and their uses.

We also picked up an example of incorrect character encoding usage in Java and saw how to get that right. Finally, we discussed some other common error scenarios related to character encoding.

Seperti biasa, kod untuk contoh boleh didapati di GitHub.