Skip to content
Advertisement

Remove non-text chars (like emoticons) from string

How can I replace chars like ???????? from a string? Sometime the YouTube video title contains characters like this. I don’t want to replace characters like !@#$%^&*().

I am currently using preg_replace('/[^A-Za-z0-9-]/', '', $VideoTitle);

Samples Array:

$VideoTitles[]='Sia 2017 Cheap Thrills 2017 live ????????'; 

$VideoTitles[]='TAYLOR SWIFT - SHAKE IT OFF ???????? #1989'; 

Expected Output:

Sia 2017 Cheap Thrills 2017 live 
TAYLOR SWIFT - SHAKE IT OFF #1989

Advertisement

Answer

Code with sample input: Demo

$VideoTitles=[
    'Kilian à Dijon #4 • Vlog #2 • Primark again !? ???? - YouTube',
    'Funfesty ???? ???? on Twitter: "Je commence à avoir mal à la tête à force',
    'Sia 2017 Cheap Thrills 2017 live ????????'
];

$VideoTitles=preg_replace('/[^ -x{2122}]s+|s*[^ -x{2122}]/u','',$VideoTitles);  // remove out of range characters and whitespace character on one side only

var_export($VideoTitles);

Output:

array (
  0 => 'Kilian à Dijon #4 • Vlog #2 • Primark again !? - YouTube',
  1 => 'Funfesty on Twitter: "Je commence à avoir mal à la tête à force',
  2 => 'Sia 2017 Cheap Thrills 2017 live',
)

The above regex pattern uses a character range from x20-x2122 (space to trade-mark-sign). I have selected this range because it should cover the vast majority of word-related characters including letters with accents and non-English characters. (Admittedly, it also includes many non-word-related characters. You may like to use two separate ranges for greater specificity like: /[^x{20}-x{60}x{7B}-x{FF}]/ui — this case-insensitively searches two ranges: space to grave accent and left curly bracket to latin small letter y with diaeresis)

If you find that this range is unnecessarily generous or takes too long to process, you can make your own decision about the appropriate character range.

For instance, you might like the much lighter but less generous /[^x20-x7E]/u (from space to tilde). However, if you apply it to either of my above French $VideoTitles then you will mangle the text by removing legitimate letters.

Here is a menu of characters and their unicode numbers to help you understand what is inside the aforementioned ranges and beyond.

*And remember to include a unicode flag u after your closing delimiter.


For completeness, I should say the literal/narrow solution for removing the two emojis would be:

$VideoTitle=preg_replace('/[x{1F3A7}x{1F3AC}]/u','',$VideoTitle);  // omit 2 emojis

These emojis are called “clapper board (U+1F3AC)” and “headphone (U+1F3A7)”.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement