ronaldduncan

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

In ASCII, development, File Formats, software, technology on October 31, 2009 at 3:09 pm

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they look at a file format and see the delimiter, think that it is a good idea to break things up with in fields using the same delimiter as the file format.

The most anoying thing about the whole problem is that it was solved by design in the ASCII character set.

If you use ASCII  31 as your field separator instead of comma or tab, and ASCII 30 as your record separator instead of new line.   Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters.

It is even part of the design of the file encoding system.  The ASCII standard calls these fields

  • 31 Unit Separator
  • 30 Record Separator

And ASCII has two more levels with Group and File Separators

  • 29 Group Separator
  • 28 File Separator

See http://en.wikipedia.org/wiki/Unit_separator and
http://en.wikipedia.org/wiki/Delimiter#ASCII_Delimited_Text

In summary ASCII Delimited Text is using the last 4 control characters (28-31) for their purpose as field and record delimiters and not using CSV (Comma Separated Values)

  1. I’ve *always* wondered why nobody ever used these characters for non-hierarchical information, and I personally think XML is overkill in most applications.

    CSVs suffer from a standardization problem. Since this pretty much obviates the need for escaping except perhaps with BLOBs, I don’t know why nobody ever thinks to do it.

    Furthermore, terminals could print them if they really wanted to with a switch to allow easy reading. UTF-8 should continue to let us use them verbatim, so I don’t see that as a problem, either.

    I think the file and group chars were used on tapes; otherwise this has all been forgotten. Too bad, it would make life a lot easier.

  2. Thanks for sharing that.

    cheers

  3. Good point.

    So, what if somebody decided to implement a ADT format should the ‘official’ format name remain specified as ‘ASCII Delimited Text’ and what extension should these files carry?

    With the acceptance of the ASCII sub-set into UTF-8 I personally think just ‘Delimited Text’ would be more appropriate. The term ‘ASCII’ itself just feels anachronistic these days.

    I have actually implemented a complete (ie RFC 4801) CSV parser in Javascript and it was a complete pain in the ass.The RFC creates a decent initial baseline but it’s still missing 3 rules (involving handling null values) and it’s very challenging to implement.

    By far the worst case is where new-line characters are added as values (delimited by quotes). In that case, no variation of split([newlines]) will work without breaking. Getting around that takes nothing less than a FSM (Finite State Machine).

    Just for a sanity check I’m seriously debating whether I want to release a library that makes use of ASCII Delimited Text. I don’t see how any other text-based format could be more efficient and cross-platform compatible. It’s about time somebody steps up and creates a working implementation.

    If you have any interest, feel free to contact me.

    • Thanks Evan,
      My company @UK uses it for all our internal ETL activities since it is a simple clean way of tranfering data. I have been looking at it as a way of transfering xml data in a more efficent format. The problem is that XML has both Elements and Attributes, which are duplication that makes it more difficult to represent in a cleaner structure with out having to have a start/end tag.

      XML also has the maddest escape logic I have come across the only thing that should need escaped is < and it feels like lots and lots of arbitary chars are escaped.

      Regards Ronald

  4. Interesting.

    I’m not sure how your back-end works but if you’re exposing a REST service via a HTTP server you could always use intermediate gzip compression. If you use a data format where the attribute keys are made up of the column headers (ie there’s a lot of repetition) it shouldn’t be too hard to compress away most of the overhead that XML adds.

    XML is really intended to be used as a human-readable storage format. If you’re just exchanging internal data, I would personally use JSON. It’s much more compact, is optimized to be machine-readable, and you could find a (de)serializer implementation in just about any language.

    I have a working implementation of a data-specific XML micro format (serializable by python) for a proof-of-concept but haven’t publicly documented any details about it yet. Actually, I have a complete REST API service layer written for GAE (Google App Engine) that I eventually plan to document. A ASCII delimited format will probably be added at some point in the future.

    If you’re interested in hearing about it, feel free to email me or add me on G+:
    https://plus.google.com/112882755236529658404

    Here’s the link to the jquery-csv project:
    http://code.google.com/p/jquery-csv/

Leave a comment