mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding — 检测字符的编码

说明

mb_detect_encoding ( string $str , mixed $encoding_list = mb_detect_order() , bool $strict = false ) : string

检测字符串 str 的编码。

参数

str

待检查的字符串。

encoding_list

encoding_list 是一个字符编码列表。编码顺序可以由数组或者逗号分隔的列表字符串指定。

如果省略了 encoding_list 将会使用 detect_order。

strict

strict 指定了是否严格地检测编码。默认是 false。

返回值

检测到的字符编码，或者无法检测指定字符串的编码时返回 false。

范例

Example #1 mb_detect_encoding() 例子


<?php
/* 使用当前的 detect_order 来检测字符编码 */
echo mb_detect_encoding($str);

/* "auto" 将根据 mbstring.language 来扩展 */
echo mb_detect_encoding($str, "auto");

/* 通过逗号分隔的列表来指定编码列表 encoding_list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* 使用数组来指定编码列表 encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);
?>

参见

mb_detect_order() - 设置/获取字符编码的检测顺序

User Contributed Notes

recentUser at example dot com 28-Mar-2018 07:17


In my environment (PHP 7.1.12),

"mb_detect_encoding()" doesn't work

     where "mb_detect_order()" is not set appropriately.



To enable "mb_detect_encoding()" to work in such a case,

     simply put "mb_detect_order('...')"

     before "mb_detect_encoding()" in your script file.



Both 

     "ini_set('mbstring.language', '...');"

     and

     "ini_set('mbstring.detect_order', '...');"

DON'T work in script files for this purpose

whereas setting them in PHP.INI file may work.

lotushzy at gmail dot com 04-Jan-2018 12:18


About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:

mb_detect_encoding('áéóú', 'UTF-8', true); // false

but now the result is not false, can you give me reason, thanks!

garbage at iglou dot eu 30-Mar-2017 01:11


For detect UTF-8, you can use:



if (preg_match('!!u', $str)) { echo 'utf-8'; }



- Norihiori

lexonight at yahoo dot com 06-Nov-2016 12:54


<?php

$file = file_get_contents("somefile.txt");

$encodings = implode(',', mb_list_encodings());

echo mb_detect_encoding($file, $encodings, true);

?>

seems to work

emoebel at web dot de 25-Dec-2013 09:29


if the  function " mb_detect_encoding" does not exist  ... 



... try: 



<?php 

// ---------------------------------------------------- 

if ( !function_exists('mb_detect_encoding') ) { 



// ---------------------------------------------------------------- 

function mb_detect_encoding ($string, $enc=null, $ret=null) { 

       

        static $enclist = array( 

            'UTF-8', 'ASCII', 

            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 

            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 

            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 

            'Windows-1251', 'Windows-1252', 'Windows-1254', 

            );

        

        $result = false; 

        

        foreach ($enclist as $item) { 

            $sample = iconv($item, $item, $string); 

            if (md5($sample) == md5($string)) { 

                if ($ret === NULL) { $result = $item; } else { $result = true; } 

                break; 

            }

        }

        

    return $result; 

} 

// ---------------------------------------------------------------- 



} 

// ---------------------------------------------------- 

?>



example / usage of: mb_detect_encoding() 



<?php 

// ------------------------------------------------------ 

function str_to_utf8 ($str) { 

    

    if (mb_detect_encoding($str, 'UTF-8', true) === false) { 

    $str = utf8_encode($str); 

    }



    return $str;

}

// ------------------------------------------------------ 

?>



$txtstr = str_to_utf8($txtstr);

Anonymous 08-Oct-2013 09:17


// ----------------------------------------------------------- 



if(!function_exists('mb_detect_encoding')) {



function mb_detect_encoding($string, $enc=null, $ret=true) {

    $out=$enc; 

    static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');

        foreach ($list as $item) {

            $sample = iconv($item, $item, $string);

            if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } 

        } 

    return $out;

}



}



// -----------------------------------------------------------

eyecatchup at gmail dot com 11-Jun-2013 10:41


Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:



<?php

  if (preg_match("//u", $string)) {

      // $string is valid UTF-8

  }

bmrkbyet at web dot de 24-Mar-2013 02:04


a) if the FUNCTION mb_detect_encoding is not available: 



### mb_detect_encoding ... iconv ###



<?php

// -------------------------------------------



if(!function_exists('mb_detect_encoding')) { 

function mb_detect_encoding($string, $enc=null) { 

    

    static $list = array('utf-8', 'iso-8859-1', 'windows-1251');

    

    foreach ($list as $item) {

        $sample = iconv($item, $item, $string);

        if (md5($sample) == md5($string)) { 

            if ($enc == $item) { return true; }    else { return $item; } 

        }

    }

    return null;

}

}



// -------------------------------------------

?>



b) if the FUNCTION mb_convert_encoding is not available: 



### mb_convert_encoding ... iconv ###



<?php

// -------------------------------------------



if(!function_exists('mb_convert_encoding')) { 

function mb_convert_encoding($string, $target_encoding, $source_encoding) { 

    $string = iconv($source_encoding, $target_encoding, $string); 

    return $string; 

}

}



// -------------------------------------------

?>

Gerg Tisza 18-Feb-2011 03:43


If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.



<?php

    $str = 'áéóú'; // ISO-8859-1

    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'

    mb_detect_encoding($str, 'UTF-8', true); // false

?>

nat3738 at gmail dot com 22-May-2009 03:58


A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)



<?php

// Unicode BOM is U+FEFF, but after encoded, it will look like this.

define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));

define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));

define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));

define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));

define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));



function detect_utf_encoding($filename) {



    $text = file_get_contents($filename);

    $first2 = substr($text, 0, 2);

    $first3 = substr($text, 0, 3);

    $first4 = substr($text, 0, 3);

    

    if ($first3 == UTF8_BOM) return 'UTF-8';

    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';

    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';

    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';

    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';

}

?>

prgss at bk dot ru 30-Mar-2009 02:16


Another light way to detect character encoding:

<?php

function detect_encoding($string) {  

  static $list = array('utf-8', 'windows-1251');

  

  foreach ($list as $item) {

    $sample = iconv($item, $item, $string);

    if (md5($sample) == md5($string))

      return $item;

  }

  return null;

}

?>

matthijs at ischen dot nl 28-Mar-2009 10:33


I seriously underestimated the importance of setlocale...

<?php

$strings = array(

    "mais coisas a pensar sobre diário ou dois!",

    "plus de choses à penser à journalier ou à deux !",

    "?más cosas a pensar en diario o dos!",

    "più cose da pensare circa giornaliere o due!",

    "flere ting ? tenke p? hver dag eller to!",

    "Dal?í věcí, p?emy?let o ka?dy den nebo dva!",

    "mehr über Spa? sp?t sch?nen",

    "m? von? gjat? fun bukur",

    "t?bb mint szórakozás kés? csodálatos kenyér"

);



$convert = array();

setlocale(LC_CTYPE, 'de_DE.UTF-8');

foreach( $strings as $string )

        $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);

?>



Produces the following: 



Array

(

    [0] => mais coisas a pensar sobre diario ou dois!

    [1] => plus de choses a penser a journalier ou a deux !

    [2] => ?mas cosas a pensar en diario o dos!

    [3] => piu cose da pensare circa giornaliere o due!

    [4] => flere ting aa tenke paa hver dag eller to!

    [5] => Dalsi veci, premyslet o kazdy den nebo dva!

    [6] => mehr ueber Spass spaet schoenen

    [7] => me vone gjate fun bukur

    [8] => toebb mint szorakozas keso csodalatos kenyer

)



whereas 



<?php

$convert = array();

setlocale(LC_CTYPE, 'nl_NL.UTF-8');

foreach( $strings as $string )

        $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);

?>



produces:

Array

(

    [0] => mais coisas a pensar sobre di?rio ou dois!

    [1] => plus de choses ? penser ? journalier ou ? deux !

    [2] => ?m?s cosas a pensar en diario o dos!

    [3] => pi? cose da pensare circa giornaliere o due!

    [4] => flere ting ? tenke p? hver dag eller to!

    [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!

    [6] => mehr ?ber Spass sp?t sch?nen

    [7] => m? von? gjat? fun bukur

    [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r

)



This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.

dennis at nikolaenko dot ru 06-Oct-2008 09:18


Beware of bug to detect Russian encodings

http://bugs.php.net/bug.php?id=38138

hmdker at gmail dot com 23-Aug-2008 09:58


Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.



<?php

function is_utf8($str) {

    $c=0; $b=0;

    $bits=0;

    $len=strlen($str);

    for($i=0; $i<$len; $i++){

        $c=ord($str[$i]);

        if($c > 128){

            if(($c >= 254)) return false;

            elseif($c >= 252) $bits=6;

            elseif($c >= 248) $bits=5;

            elseif($c >= 240) $bits=4;

            elseif($c >= 224) $bits=3;

            elseif($c >= 192) $bits=2;

            else return false;

            if(($i+$bits) > $len) return false;

            while($bits > 1){

                $i++;

                $b=ord($str[$i]);

                if($b < 128 || $b > 191) return false;

                $bits--;

            }

        }

    }

    return true;

}

?>

yaqy at qq dot com 20-Jul-2008 10:14


<?php


/*


*QQ: 290359552


* conver to Utf8 if $str is not equals to 'UTF-8'


*/


function convToUtf8($str)


{


if( mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" )


{





return  iconv("gbk","utf-8",$str);





}


else


{


return $str;


}





}


?>

rl at itfigures dot nl 04-Sep-2007 02:00


I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.



The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 



I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:



if(detectUTF8($str)){

  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 

  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);

  $str=str_replace("&euro;","\x80",$str); 

}



If html-output is needed the last line is not necessary (and even unwanted).

sunggsun 15-Aug-2006 12:26


from PHPDIG



    function isUTF8($str) {

        if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {

            return true;

        } else {

            return false;

        }

    }

chris AT w3style.co DOT uk 03-Aug-2006 02:22


Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.



I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.



<?php



function detectUTF8($string)

{

        return preg_match('%(?:

        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte

        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs

        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte

        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates

        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3

        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15

        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16

        )+%xs', $string);

}



?>

telemach 27-Jul-2005 06:48


beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)



mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')



returns ISO-8859-1, while 



mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')



returns UTF-8



bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding

Chrigu 29-Mar-2005 07:32


If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:

mb_detect_encoding($string, 'UTF-8, ISO-8859-1');



if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

php-note-2005 at ryandesign dot com 17-Feb-2005 07:57


Much simpler UTF-8-ness checker using a regular expression created by the W3C:



<?php



// Returns true if $string is valid UTF-8 and false otherwise.

function is_utf8($string) {

    

    // From http://w3.org/International/questions/qa-forms-utf-8.html

    return preg_match('%^(?:

          [\x09\x0A\x0D\x20-\x7E]            # ASCII

        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte

        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs

        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte

        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates

        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3

        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15

        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16

    )*$%xs', $string);

    

} // function is_utf8



?>

jaaks at playtech dot com 14-Jan-2005 12:27


Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.



Replace

         } // goto next char

with

         } else {

           return false; // 10xxxxxx occuring alone

         } // goto next char

maarten 12-Jan-2005 03:55


Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.

To verify utf 8 use the following:



//

//    utf8 encoding validation developed based on Wikipedia entry at:

//    http://en.wikipedia.org/wiki/UTF-8

//

//    Implemented as a recursive descent parser based on a simple state machine

//    copyright 2005 Maarten Meijer

//

//    This cries out for a C-implementation to be included in PHP core

//

    function valid_1byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0x80) == 0x00;

    }

    

    function valid_2byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xE0) == 0xC0;

    }



    function valid_3byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF0) == 0xE0;

    }



    function valid_4byte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xF8) == 0xF0;

    }

    

    function valid_nextbyte($char) {

        if(!is_int($char)) return false;

        return ($char & 0xC0) == 0x80;

    }

    

    function valid_utf8($string) {

        $len = strlen($string);

        $i = 0;    

        while( $i < $len ) {

            $char = ord(substr($string, $i++, 1));

            if(valid_1byte($char)) {    // continue

                continue;

            } else if(valid_2byte($char)) { // check 1 byte

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_3byte($char)) { // check 2 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } else if(valid_4byte($char)) { // check 3 bytes

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

                if(!valid_nextbyte(ord(substr($string, $i++, 1))))

                    return false;

            } // goto next char

        }

        return true; // done

    }



for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png