Text manipulation with GAMAP: Difference between revisions
Line 155: | Line 155: | ||
GAMAP ships with the following string inquiry functions: | GAMAP ships with the following string inquiry functions: | ||
;ISALGEBRAIC: Locates the position of algebraic characters in a string (e.g. locations that are EITHER digits '.' OR +/- signs). | :;ISALGEBRAIC: Locates the position of algebraic characters in a string (e.g. locations that are EITHER digits '.' OR +/- signs). | ||
;ISALNUM: Locates the position of alphanumeric characters ( A...Z, a...z, 0..9 ) in a string. | :;ISALNUM: Locates the position of alphanumeric characters ( A...Z, a...z, 0..9 ) in a string. | ||
;ISALPHA: Locates the positions of alphabetic characters ( A...Z, a...z ) in a string. | :;ISALPHA: Locates the positions of alphabetic characters ( A...Z, a...z ) in a string. | ||
;ISDIGIT: Locates the positions of numeric characters ( '0' ... '9') in a string. | :;ISDIGIT: Locates the positions of numeric characters ( '0' ... '9') in a string. | ||
;ISGRAPH: Locates the positions of graphics characters (i.e. printable characters excluding SPACE) in a string. | :;ISGRAPH: Locates the positions of graphics characters (i.e. printable characters excluding SPACE) in a string. | ||
;ISLOWER: Locates the positions of lowercase alphabetic characters in a string. | :;ISLOWER: Locates the positions of lowercase alphabetic characters in a string. | ||
;ISPRINT: Locates the positions of all printable characters (including SPACE) in a string. | :;ISPRINT: Locates the positions of all printable characters (including SPACE) in a string. | ||
;ISSPACE: Locates the positions of all white space characters in a string. | :;ISSPACE: Locates the positions of all white space characters in a string. | ||
;ISUPPER: Locates the positions of all uppercase alphabetic characters in a string. | :;ISUPPER: Locates the positions of all uppercase alphabetic characters in a string. | ||
Each of the above routines return a vector of 0's and 1's corresponding to each character in the string that satisfies the given criteria. | Each of the above routines return a vector of 0's and 1's corresponding to each character in the string that satisfies the given criteria. |
Revision as of 14:03, 16 April 2008
String basics
Creating text and numeric strings
We may form a string of text characters in IDL in the following ways:
- by placing text between single and double quotes
- by parsing a number with IDL's STRING function
For example:
; Create a text string IDL> str1 = 'hello world' IDL> help, str1 STR1 STRING = 'hello world' ; Create a numeric string IDL> num2 = 3.14159 IDL> str2 = string( num2 ) IDL> help, str2 STR2 STRING = ' 3.14159' ; Strip leading and trailing white space IDL> str2 = strtrim( str2, 2 ) IDL> help, str2 STR2 STRING = '3.14159'
In the last example, we used IDL's STRTRIM function to strip the leading and trailing whitespace.
Equivalence of strings and byte arrays
In IDL, a string of text characters is equivalent to an array of byte values. A byte is a collection of 8 bits and may express values from 0-255. The ASCII collating sequence has 255 values. (Actually, the original ASCII table had 128 values, but this was later extended to 255 values to include special characters.) One byte represents a single ASCII text character.
This means that it is easy to convert between strings and bytes in IDL. If you have an array of bytes, you can use any of the IDL string routines on them, for example:
IDL> byte_array = [ 72B, 69B, 76B, 76B, 79B ] IDL> help, byte_array BYTE_ARRAY BYTE = Array[5] IDL> print, strtrim( byte_array, 2 ) HELLO
GAMAP comes with a very useful routine called STR2BYTE. This allows you to take a text string and to convert it into the equivalent array of bytes.
IDL> str = 'IDL is neat!' IDL> byte_array = str2byte( str, strlen( str ) ) IDL> help, byte_array BYTE_ARRAY BYTE = Array[12] IDL> print, byte_array 73 68 76 32 105 115 32 110 101 97 116 33
Note that we used IDL's STRLEN function to return the length of the string.
Representing special characters
We must specify some special non-printing ASCII characters with their byte value. For exaaple, the horizontal tab character is the 9th character in the ASCII table, so we may specify that as:
IDL> tab = 9B IDL> help, tab TAB BYTE = 9 IDL> str = 'hello' + string(tab) + 'world' IDL> print, str hello world
For more information about IDL's string functions, please see http://idlastro.gsfc.nasa.gov/idl_html_help/Strings.html.
Locating text within a string
The following routines can be used to locate text within a string variable:
- STRPOS
- IDL routine to test for the existence of a substring within a string
- STRWHERE
- GAMAP routine that returns the locations of a single character within a string
- STRRIGHT
- GAMAP routine that returns the last N characters from a string
IDL's STRPOS routine is an easy way to test if a given substring is located within larger string:
IDL> print, strpos( 'She sells seashells by the seashore', 'sea' ) 10
Note that even though the substring "sea" occurs twice in the above string, STRPOS will only return the location of the first occurrence.
GAMAP's STRWHERE function returns the location of a single character in a larger string.
IDL> print, strwhere( 'anthony aardvark asked about auditory access', 'a' ) 0 8 9 13 17 23 29 38
GAMAP's STRRIGHT function returns the last N characters from a string.
IDL> print, strright( 'anthony aardvark asked about auditory access', 6 ) access
Replacing characters in a string
The following routines can be used to replace text within a string variable:
- STRPUT
- IDL routine to insert text into a string
- REPLACE_TOKEN
- GAMAP routine that replaces occurrences of tokens with text. Can also be used to expand wildcards with a name list.
- STRREPL
- GAMAP routine that replaces all occurences of one character in a string with another character.
IDL's STRPUT function is one way to insert characters into a string of text:
IDL> str1 = 'Now is the winter of our discontent' IDL> strput, str1, 'summer', 11 IDL> print, str1 Now is the summer of our discontent
However, this requires that you provide the location in the string where the text replacement will take place. In the above example, we insert the text at character 11 (the 1st character in a string is always character 0).
The above task is much more easily accomplished with GAMAP's REPLACE_TOKEN function:
IDL> str1 = 'Now is the winter of our discontent' IDL> str2 = replace_token( str1, 'winter', 'summer', delim= ) IDL> print, str2 Now is the summer of our discontent
With REPLACE_TOKEN you do not need to know the position in the string where the replacement text will be inserted.
GAMAP also has another function called STRREPL that allows you to replace multiple instances of a single character in a string. For example:
IDL> print, strrepl( 'Mississippi', 'i', 'a' ) Massassappa
But if you need to replace an entire word rather than just single characters it's better to use REPLACE_TOKEN.
Splitting strings into substrings
You can split a string into individual substrings with GAMAP's STRBREAK function.
; Use STRBREAK to split the line by spaces IDL> result = strbreak( 'The sunshine of our li..ii..ii..ii..ife', ' ' ) IDL> for j = 0, n_elements( result )-1 do print, result[i] IDL> for j = 0, n_elements( result )-1 do print, result[j] The sunshine of our li..ii..ii..ii..ife ; Use STRBREAK to split the line by commas IDL> result = strbreak( 'Parsley,Sage,Rosemary,and Thyme', ',' ) IDL> for j = 0, n_elements( result )-1 do print, result[j] Parsley Sage Rosemary and Thyme
We recommend that you use GAMAP's STRBREAK rather than IDL's STRSPLIT or STR_SEP routines. STR_SEP was the standard routine to separate strings until IDL 5.2. In IDL 5.3 and higher, STR_SEP was obsoleted and replaced with the new STRSPLIT routine.
- If you are using IDL 5.2 or lower, then STRBREAK will call STR_SEP to break the string.
- If you are using IDL 5.3 or higher, then STRBREAK will call STRSPLIT to break the string.
Therefore, STRBREAK will work properly regardless of which version of IDL you are using.
GAMAP's string inquiry functions
GAMAP ships with the following string inquiry functions:
- ISALGEBRAIC
- Locates the position of algebraic characters in a string (e.g. locations that are EITHER digits '.' OR +/- signs).
- ISALNUM
- Locates the position of alphanumeric characters ( A...Z, a...z, 0..9 ) in a string.
- ISALPHA
- Locates the positions of alphabetic characters ( A...Z, a...z ) in a string.
- ISDIGIT
- Locates the positions of numeric characters ( '0' ... '9') in a string.
- ISGRAPH
- Locates the positions of graphics characters (i.e. printable characters excluding SPACE) in a string.
- ISLOWER
- Locates the positions of lowercase alphabetic characters in a string.
- ISPRINT
- Locates the positions of all printable characters (including SPACE) in a string.
- ISSPACE
- Locates the positions of all white space characters in a string.
- ISUPPER
- Locates the positions of all uppercase alphabetic characters in a string.
Each of the above routines return a vector of 0's and 1's corresponding to each character in the string that satisfies the given criteria.
Some examples:
IDL> str = '#99# Bottles of *Beer* on the Wall!' IDL> print, isalgebraic( str ), format='(35i1)' 01100000000000000000000000000000000 IDL> print, isalnum( str ), format='(35i1)' 01100111111101100111100110111011110 IDL> print, isalpha( str ), format='(35i1)' 00000111111101100111100110111011110 IDL> print, isdigit( str ), format='(35i1)' 01100000000000000000000000000000000 IDL> print, isgraph( str ), format='(35i1)' 11110111111101101111110110111011111 IDL> print, islower( str ), format='(35i1)' 00000011111101100011100110111001110 IDL> print, isprint( str ), format='(35i1)' 11111111111111111111111111111111111 IDL> print, isspace( str ), format='(35i1)' 00001000000010010000001001000100000 IDL> print, isupper( str ), format='(35i1)' 00000100000000000100000000000010000
GAMAP's string formatting functions
GAMAP ships with the following string formatting functions:
- STRSCI
- Converts a number to a string in scientific notation format ( e.g. A x 10^B )
- STRCHEM
- Superscripts or subscripts numbers and special characters ('x', 'y') found in strings containing names of chemical species.
STRSCI can be used to put a string into scientific notation, for plotting purposes:
IDL> str = STRSCI( 2000000, format='(i1)' ) IDL> print, str 2 x 10!u6!n
STRCHEM can be used to create strings with superscripts and subscripts (e.g. H2O, 222Rn) for plotting purposes:
IDL> print, strchem( 'NOx', /sub ) NO!lx!n IDL> print, strchem( '222Rn', /sup ) !u2!n!u2!n!u2!nRn
--Bmy 09:58, 16 April 2008 (EDT)