CypherNLP

From Traxel Wiki
Jump to navigation Jump to search

Text Cleanup

From GPT

Simplifying text by removing or replacing emojis, Unicode characters, and other non-standard text elements can be important for certain text processing tasks, especially when these elements don't add meaningful content or can introduce noise. Here's how you can handle these elements:

1. Removing Emojis
  • Emojis are Unicode characters, so they have specific code points. You can use a regular expression to filter out these code points.
import re

def remove_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U00002702-\U000027B0"  # Dingbats
                           u"\U000024C2-\U0001F251"  
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
2. Handling Unicode Characters
  • If you want to keep only ASCII characters (and remove all other Unicode characters):
def keep_ascii_only(text):
    return text.encode('ascii', 'ignore').decode('ascii')
  • If you want to replace non-ASCII characters with a placeholder or space:
def replace_non_ascii(text, replacement=' '):
    return ''.join([char if ord(char) < 128 else replacement for char in text])
3. Using Libraries
  • Libraries like unidecode can be used to convert Unicode text into its closest ASCII representation. This is especially useful for transliterating characters from other languages into English characters.
from unidecode import unidecode

text = "Mëtàl"
print(unidecode(text))  # Output: "Metal"
  • Installation: pip install unidecode
4. Stripping or Replacing Specific Unicode Categories
  • Sometimes, you might want to remove or replace specific categories of Unicode characters, like non-breaking spaces, mathematical symbols, etc. Python's unicodedata module can help identify categories.
import unicodedata

def strip_unicode_categories(text, categories):
    return ''.join([char for char in text if unicodedata.category(char) not in categories])

Then, to remove, for instance, letter-like symbols and mathematical symbols, you can call:

cleaned_text = strip_unicode_categories(text, ["Lo", "Sm"])

Remember to consider the context and purpose of your text processing to determine which simplifications are necessary and how aggressive you want to be.