Skip to content
Advertisement

Remove non-text chars (like emoticons) from string

How can I replace chars like ???????? from a string? Sometime the YouTube video title contains characters like this. I don’t want to replace characters like !@#$%^&*().

I am currently using preg_replace('/[^A-Za-z0-9-]/', '', $VideoTitle);

Samples Array:

JavaScript

Expected Output:

JavaScript

Advertisement

Answer

Code with sample input: Demo

JavaScript

Output:

JavaScript

The above regex pattern uses a character range from x20-x2122 (space to trade-mark-sign). I have selected this range because it should cover the vast majority of word-related characters including letters with accents and non-English characters. (Admittedly, it also includes many non-word-related characters. You may like to use two separate ranges for greater specificity like: /[^x{20}-x{60}x{7B}-x{FF}]/ui — this case-insensitively searches two ranges: space to grave accent and left curly bracket to latin small letter y with diaeresis)

If you find that this range is unnecessarily generous or takes too long to process, you can make your own decision about the appropriate character range.

For instance, you might like the much lighter but less generous /[^x20-x7E]/u (from space to tilde). However, if you apply it to either of my above French $VideoTitles then you will mangle the text by removing legitimate letters.

Here is a menu of characters and their unicode numbers to help you understand what is inside the aforementioned ranges and beyond.

*And remember to include a unicode flag u after your closing delimiter.


For completeness, I should say the literal/narrow solution for removing the two emojis would be:

JavaScript

These emojis are called “clapper board (U+1F3AC)” and “headphone (U+1F3A7)”.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement