Removing Invalid Characters

Does anyone have a generic method (V16) that removes any invalid/non-printing characters from an input object where the user may copy text and it contains these type of characters?

We occasionally get users who basically copy data from a mainframe system and it brings along a number of non-printable characters which need to be filtered out.

Thanks!

Steve

most things are possible with regex but one must define which characters are considered illegal.

also, since there are 100,000 characters in unicode
the list of acceptable characters are generally shorter.

Yes, at first clear out what you exactly mean with “invalid/non-printing characters”.
When you know which characters for your use-case is invalid (or dangerous)
than you can replace it in a loop in your method.

No, i have for this no ready to go solution, like call filter-wizzard
and get back a text which is for all use cases valid translated.

When you replace characters, it belongs to situation
to replace it with empty-string or for example with space-character.
When you replace with empty, you change the length of original text,
sometimes this is not wished. Sometimes you want to beware a separation
like replace TAB with SPACE (and not with empty).

Do replacements based on regex with REPEAT loop (only example, not ready to use):
<code 4D>
Repeat
$result:=Match regex($regexPattern;$srcTxt;$startPos;$posFound;$lengthFound)
If (($result) & ($posFound>0) & ($lengthFound>0))
$resultTxt:=$resultTxt+Substring($srcTxt;$startPos;$posFound-$startPos)+$replTxt
End if
Until (Not($result))
</code 4D>

Do any replacements without regex, loop with FOR every char in srcText (only example, not ready to use):
<code 4D>
For ($charpos;Length($srcTxt);1;-1)
$charcode:=Character code(srcTxt[$charpos])
If ($charcode<32)
Case of
: ($charcode=NUL ASCII code) //Warnung: Folgende Zeichencodes {{ 0 ; 65534 (FFFE) ; 65535 (FFFF) }} sind in 4D Version11 für Unicode reserviert und dürfen nie in einen Text eingefügt werden !!!
$charcode_OK:=False
: ($charcode=Tab key) //ok!
$charcode_OK:=True
: ($charcode=Line feed) //tolerant gesehen ok!..aber sollte eigentlich nicht in einem 4D-Textfeld vom User eingegeben werden…
$charcode_OK:=False
: ($charcode=Carriage return) //ok!
$charcode_OK:=True
Else
$charcode_OK:=False
End case
Else
Case of
: ($charcode=DEL ASCII code)
$charcode_OK:=False
: ($charcode=8232) //‘LINE SEPARATOR’ (U+2028) / 0x2028 / 

$charcode_OK:=False
: ($charcode=8233) //‘PARAGRAPH SEPARATOR’ (U+2029) / 0x2029 / 

$charcode_OK:=False

		: (($charcode>55295) & ($charcode<57344))
			  // charcodes between 55296 and 57343 maybe dangerous for Win+Mac
			  // -------------------------------------------------
			  // HighSurrogates:         55296(D800) - 56191(DB7F)
			  // HighPrivUseSurrogates:  56192(DB80) - 56319(DBFF)
			  // -------------------------------------------------
			  // LowSurrogates:          56320(DC00) - 57343(DFFF)
			  // -------------------------------------------------
			  // HighSurrogates(55296-56319) as LastByte, crashed 4D with: CONVERT FROM TEXT("UTF-8") ; QuickReport(Erzeuge Textdatei) ; SEND PACKET(Text) ; etc...   >>Ereignis1000, ApplicationError ; fehlerhaftes Modul ntdll.dll ....
			  // HighSurrogates(55296-56319) or orphanedLowSurrogates(56320-57343) as LastByte let Outforms-Display going crazy (only on Windows) ...speicherReste erscheinen plötzlich irgendwo am Schirm ...das lässt sich schlecht beschreiben, besser selber anschauen...)
			$charcode_OK:=False
			
		: ($charcode=65534)  //Warnung: Folgende Zeichencodes {{ 0 ; 65534 (FFFE) ; 65535 (FFFF) }} sind in 4D Version11 für Unicode reserviert und dürfen nie in einen Text eingefügt werden !!!
			$charcode_OK:=False
		: ($charcode=65535)  //Warnung: Folgende Zeichencodes {{ 0 ; 65534 (FFFE) ; 65535 (FFFF) }} sind in 4D Version11 für Unicode reserviert und dürfen nie in einen Text eingefügt werden !!!
			$charcode_OK:=False
		: ($charcode>65535)  // ...kleinerNull und grösser65535 dann ist es kein char-code;)...
			$charcode_OK:=False
		Else   //ok!
			$charcode_OK:=True
	End case 
End if 

End for
</code 4D>

Only four seven RegEx pattern examples (please build your own, because only you know which char-codes you want to filter):
<code 4D>
Case of
: ($chooseKey=“digits”) // Match all digits
$regExPattern:="\d"

: ($chooseKey="nonDigits")  // Match all non digits
	$regExPattern:="\\D"
	
: ($chooseKey="nulchar")  // Match a NUL character
	$regExPattern:="\\0"
	
: ($chooseKey="ctrlCodes")  // Match all control codes
	$regExPattern:="[:cntrl:]"
	
: ($chooseKey="ctrlCodes2")  // Match all control codes (2.Alternative)
	$regExPattern:="\\p{C}"
	
: ($chooseKey="anyCharWithoutForbiddenCtrls")  // Matches all chars(non ctrlCodes) and only CR+LF+TAB+verticalTab (without any other ctrlCode)
	$regExPattern:="[^\\p{C}]|[\\r\\n\\t\\v]"
	
: ($chooseKey="nonASCII")  // Match all non ascii
	$regExPattern:="[^\\x00-\\x7F]+\\ *(?:[^\\x00-\\x7F]| )*"		
	
: ($chooseKey="myTestSet")  // Match all (caseINSENSITIVE) non[wordchar|€|.|:|,|-|+|\n|\r|\t]
	$regExPattern:="(?i)[^\\w€.:,-+\\n\\r\\t]"

End case
</code 4D>

And you can have a look at documentation (all about strings and convertTextToCharset):
https://doc.4d.com/4Dv17R5/4D/17-R5/String.201-4127142.en.html
https://doc.4d.com/4Dv17R5/4D/17-R5/CONVERT-FROM-TEXT.301-4128456.en.html
https://doc.4d.com/4Dv17R5/4D/17-R5/Convert-to-text.301-4128457.en.html

Lutz,

Thank you for the response and code, very much appreciated!

Best,

Steve