使用C++讀取UTF-8及GBK系列的文本方法及原理
1.讀取UTF-8編碼文本原理
首先了解UTF-8的編碼方式,UTF-8采用可變長編碼的方式,一個字符可占1字節(jié)-6字節(jié),其中每個字符所占的字節(jié)數(shù)由字符開始的1的個數(shù)確定,具體的編碼方式如下:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
因此,對于每個字節(jié)如果起始位為“0”則說明,該字符占有1字節(jié)。
如果起始位為“10”則說明該字節(jié)不是字符的起始字節(jié)。
如果起始為為?n個“1”+1個“0”,則說明改字符占有?n個字節(jié)。其中?1≤n≤6。
因此對于UTF-8的編碼,我們只需要每次計(jì)算每個字符開始字節(jié)的1的個數(shù),就可以確定這個字符的長度。
?
2.讀取GBK系列文本原理
對于ASCII、GB2312、GBK到GB18030編碼方法是向下兼容的?,即同一個字符在這些方案中總是有相同的編碼,后面的標(biāo)準(zhǔn)支持更多的字符。
在這些編碼中,英文和中文可以統(tǒng)一地處理。區(qū)分中文編碼的方法是高字節(jié)的最高位不為0。
因此我們只需處理好GB18130,就可以處理與他兼容的所有編碼,對于GB18130使用雙字節(jié)變長編碼。
單字節(jié)部分從 0x0~0x7F 與 ASCII 編碼兼容。雙字節(jié)部分,首字節(jié)從 0x81~0xFE,尾字節(jié)從 0x40~0x7E以及 0x80~0xFE,與GBK標(biāo)準(zhǔn)基本兼容。
因此只需檢測首字節(jié)是否小于0x81即可確定其為單字節(jié)編碼還是雙字節(jié)編碼。
?
3.C++代碼實(shí)現(xiàn)
對于一個語言處理系統(tǒng),讀取不同編碼的文本應(yīng)該是最基礎(chǔ)的需求,文本的編碼方式應(yīng)該對系統(tǒng)其他調(diào)用者透明,只需每次獲取一個字符即可,而不需要關(guān)注這個文本的編碼方式。從而我們定義了抽象類Text,及其接口ReadOneChar,并使兩個文本類GbkText和UtfText繼承這個抽象類,當(dāng)系統(tǒng)需要讀取更多種編碼的文件時(shí),只需要定義新的類然后繼承該抽象類即可,并不需要更改調(diào)用該類的代碼。從而獲得更好的擴(kuò)展性。
更好的方式是使用簡單工廠模式,使不同的文本編碼格式對于調(diào)用類完全透明,簡單工廠模式詳解請參看:C++實(shí)現(xiàn)設(shè)計(jì)模式之 — 簡單工廠模式
其中Text抽象類的定義如下:
?1?#ifndef?TEXT_H ?2?#define?TEXT_H ?3?#include4?#include5?using?namespace?std; ?6?class?Text ?7?{ ?8?????protected: ?9?????????char?*?m_binaryStr; 10?????????size_t?m_length; 11?????????size_t?m_index; 12?????public: 13?????????Text(string?path); 14?????????void?SetIndex(size_t?index); 15?????????virtual?bool?ReadOneChar(string?&oneChar)?=?0; 16?????????size_t?Size(); 17?????????virtual?~Text(); 18?}; 19?#endif
View Code
Text抽象類的實(shí)現(xiàn)如下:
?1?#include?"Text.h" ?2?using?namespace?std; ?3?Text::Text(string?path):m_index(0) ?4?{ ?5?????filebuf?*pbuf; ?6?????ifstream?filestr; ?7?????//?采用二進(jìn)制打開? ?8?????filestr.open(path.c_str(),?ios::binary); ?9?????if(!filestr) 10?????{ 11?????????cerr<<path<<"?Load?text?error."<pubseekoff(0,ios::end,ios::in); 18?????pbuf->pubseekpos(0,ios::in); 19?????//?分配內(nèi)存空間 20?????m_binaryStr?=?new?char[m_length+1]; 21?????//?獲取文件內(nèi)容 22?????pbuf->sgetn(m_binaryStr,m_length); 23?????//關(guān)閉文件 24?????filestr.close(); 25?} 26? 27?void?Text::SetIndex(size_t?index) 28?{ 29?????m_index?=?index; 30?} 31? 32?size_t?Text::Size() 33?{ 34?????return?m_length; 35?} 36? 37?Text::~Text() 38?{ 39?????delete?[]?m_binaryStr; 40?}
View Code
GBKText類的定義如下:
#ifndef?GBKTEXT_H #define?GBKTEXT_H #include#include#include?"Text.h" using?namespace?std; class?GbkText:public?Text { public: ????GbkText(string?path); ????~GbkText(void); ????bool?ReadOneChar(string?&?oneChar); }; #endif
View Code
GBKText類的實(shí)現(xiàn)如下:
?1?#include?"GbkText.h" ?2?GbkText::GbkText(string?path):Text(path){} ?3?GbkText::~GbkText(void)?{} ?4?bool?GbkText::ReadOneChar(string?&?oneChar) ?5?{ ?6?????//?return?true?表示讀取成功, ?7?????//?return?false?表示已經(jīng)讀取到流末尾 ?8?????if(m_length?==?m_index) ?9?????????return?false; 10?????????if((unsigned?char)m_binaryStr[m_index]?<?0x81) 11?????{ 12?????????oneChar?=?m_binaryStr[m_index]; 13?????????m_index++; 14?????} 15?????else 16?????{ 17?????????oneChar?=?string(m_binaryStr,?2); 18?????????m_index?+=?2; 19?????} 20?????return?true; 21?}
View Code
UtfText類的定義如下:
?1?#ifndef?UTFTEXT_H ?2?#define?UTFTEXT_H ?3?#include4?#include5?#include?"Text.h" ?6?using?namespace?std; ?7?class?UtfText:public?Text ?8?{ ?9?public: 10?????UtfText(string?path); 11?????~UtfText(void); 12?????bool?ReadOneChar(string?&?oneChar); 13?private: 14?????size_t?get_utf8_char_len(const?char?&?byte); 15?}; 16?#endif
View Code
UtfText類的實(shí)現(xiàn)如下:
?1?#include?"UtfText.h" ?2?UtfText::UtfText(string?path):Text(path){} ?3?UtfText::~UtfText(void)?{} ?4?bool?UtfText::ReadOneChar(string?&?oneChar) ?5?{ ?6?????//?return?true?表示讀取成功, ?7?????//?return?false?表示已經(jīng)讀取到流末尾 ?8?????if(m_length?==?m_index) ?9?????????return?false; 10?????size_t?utf8_char_len?=?get_utf8_char_len(m_binaryStr[m_index]); 11?????if(?0?==?utf8_char_len?) 12?????{ 13?????????????oneChar?=?""; 14?????????????m_index++; 15?????????return?true; 16?????} 17?????size_t?next_idx?=?m_index?+?utf8_char_len; 18?????if(?m_length?<?next_idx?) 19?????{ 20?????????//cerr?<<?"Get?utf8?first?byte?out?of?input?src?string."?<<?endl; 21?????????next_idx?=?m_length; 22?????} 23?????//輸出UTF-8的一個字符 24?????oneChar?=?string(m_binaryStr?+?m_index,?next_idx?-?m_index); 25?????//重置偏移量 26?????m_index?=?next_idx; 27?????return?true; 28?} 29? 30? 31?size_t?UtfText::get_utf8_char_len(const?char?&?byte) 32?{ 33?????//?return?0?表示錯誤 34?????//?return?1-6?表示正確值 35?????//?不會?return?其他值? 36? 37?????//UTF8?編碼格式: 38?????//?????U-00000000?-?U-0000007F:?0xxxxxxx?? 39?????//?????U-00000080?-?U-000007FF:?110xxxxx?10xxxxxx?? 40?????//?????U-00000800?-?U-0000FFFF:?1110xxxx?10xxxxxx?10xxxxxx?? 41?????//?????U-00010000?-?U-001FFFFF:?11110xxx?10xxxxxx?10xxxxxx?10xxxxxx?? 42?????//?????U-00200000?-?U-03FFFFFF:?111110xx?10xxxxxx?10xxxxxx?10xxxxxx?10xxxxxx?? 43?????//?????U-04000000?-?U-7FFFFFFF:?1111110x?10xxxxxx?10xxxxxx?10xxxxxx?10xxxxxx?10xxxxxx?? 44? 45?????size_t?len?=?0; 46?????unsigned?char?mask?=?0x80; 47?????while(?byte?&?mask?) 48?????{ 49?????????len++; 50?????????if(?len?>?6?) 51?????????{ 52?????????????//cerr?<<?"The?mask?get?len?is?over?6."?<<?endl; 53?????????????return?0; 54?????????} 55?????????mask?>>=?1; 56?????} 57?????if(?0?==?len) 58?????{ 59?????????return?1; 60?????} 61?????return?len; 62?}
View Code
工廠類TextFactory的類定義如下:
?1?#ifndef?TEXTFACTORY_H ?2?#define?TEXTFACTORY_H ?3?#include4?#include?"Text.h" ?5?#include?"UtfText.h" ?6?#include?"GbkText.h" ?7?using?namespace?std; ?8?class?TextFactory ?9?{ 10?????public: 11?????????static?Text?*?CreateText(string?textCode,?string?path); 12?}; 13?#endif
View Code
工廠類的實(shí)現(xiàn)如下:
?1?#include?"TextFactory.h" ?2?#include?"Text.h" ?3?Text?*?TextFactory::CreateText(string?textCode,?string?path) ?4?{ ?5?????if(?(textCode?==?"utf-8")? ?6?????????????????||?(textCode?==?"UTF-8")? ?7?????????????????||?(textCode?==?"ISO-8859-2") ?8?????????????????||?(textCode?==?"ascii")? ?9?????????????????||?(textCode?==?"ASCII") 10?????????????????||?(textCode?==?"TIS-620") 11?????????????????||?(textCode?==?"ISO-8859-5")? 12?????????????????||?(textCode?==?"ISO-8859-7")?)? 13?????{ 14?????????return?new?UtfText(path); 15?????} 16?????else?if((textCode?==?"windows-1252")? 17?????????????????||?(textCode?==?"Big5") 18?????????????????||?(textCode?==?"EUC-KR")? 19?????????????????||?(textCode?==?"GB2312")? 20?????????????????||?(textCode?==?"ISO-2022-CN")? 21?????????????????||?(textCode?==?"HZ-GB-2312")? 22?????????????????||?(textCode?==?"gb18030")) 23?????{ 24?????????return?new?GbkText(path); 25?????} 26?????return?NULL; 27?}
View Code?
測試的Main函數(shù)如下:
?1?#include2?#include3?#include4?#include?"Text.h" ?5?#include?"TextFactory.h" ?6?#include?"CodeDetector.h" ?7?using?namespace?std; ?8?int?main(int?argc,?char?*argv[]) ?9?{ 10?????string?path?="日文";? 11?????string?code?="utf-8"; 12?????Text?*?t?=?TextFactory::CreateText(code,?path); 13?????string?s; 14?????while(t->ReadOneChar(s)) 15?????{ 16?????????cout<<s; 17?????} 18?????delete?t; 19?}
View Code
?
編譯運(yùn)行后即可在控制臺輸出正確的文本。