ronaldduncan

Archive for the ‘ASCII’ Category

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

In ASCII, development, File Formats, software, technology on October 31, 2009 at 3:09 pm

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they look at a file format and see the delimiter, think that it is a good idea to break things up with in fields using the same delimiter as the file format.

The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.

If you use ASCII  31 as your field separator instead of comma or tab, and ASCII 30 as your record separator instead of new line.   Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters.

It is even part of the design of the file encoding system.  The ASCII standard calls these fields

  • 31 Unit Separator
  • 30 Record Separator

And ASCII has two more levels with Group and File Separators

  • 29 Group Separator
  • 28 File Separator

See http://en.wikipedia.org/wiki/Unit_separator and
http://en.wikipedia.org/wiki/Delimiter#ASCII_Delimited_Text

In summary ASCII Delimited Text is using the last 4 control characters (28-31) for their purpose as field and record delimiters and not using CSV (Comma Separated Values)

Advertisements