Advanced String Splitting

March 4th, 2008 – 1:08 pm

I recently had a project where I had to run some fancy regex and recursion on strings written as excel-style formulas. It was an enjoyable project, as I don't often have the opportunity to use regex or recursion in Flex. However, one of the steps in the process required me to perform what, I thought, should have been a relatively simple task: Take a string of values and separate them by a comma delimiter.

Under normal circumstances, the solution should be as simple as:

Actionscript:
  1. var arr:Array = str.split(",");

However, this failed to account for nested delimiters. For example, A,B,(C,D) would split into A B (C D), instead of the more desired result of A B (C,D). Standard string manipulation indiscriminately splits regardless of the existence of parenthesis, or any grouping operators for that matter.

My next attempt was to solve this using regex. This worked for a little while, before I realized yet another challenge: Greedy regex will fail in the presence of sibling groups, while lazy regex will fail in the presence of nested groups. What we need here is something far more specific...we need to count parenthesis. This is the solution I came up with:

Actionscript:
  1. public function split(args:String):Array{
  2. var arr:Array = [];
  3. var i:int = 0;
  4. var d:int = 0;
  5. var posOP:int = args.indexOf("(");
  6. var posCP:int = args.indexOf(")");
  7. var posCM:int = args.indexOf(",");</p><p>while(posOP>= 0 || posCP>= 0 || posCM>= 0){</p><p>// check to see if there's an open parenthesis without a close parenthesis
  8. if(posOP>= 0 && posCP <0) throw new Error("syntax error: missing )");</p><p>// if comma exists and is closest, check the depth
  9. if(posCM>= 0 && (posOP <0 || posCM <posOP) && (posCP <0 || posCM <posCP)){</p><p>// if depth is 0, then split string
  10. if(d == 0){
  11. arr.push(args.substring(i, posCM));
  12. i = posCM+1;
  13. }
  14. posCM = args.indexOf(",", posCM+1);</p><p>// if open parenthesis exists and is closest, increment depth
  15. }else if(posOP>= 0 && posOP <posCP){
  16. d++;
  17. posOP = args.indexOf("(", posOP+1);</p><p>// else decrement depth
  18. }else{
  19. d--;
  20. if(d<0) throw new Error("syntax error: found ) before (");
  21. posCP = args.indexOf(")", posCP+1);
  22. }
  23. }</p><p>arr.push(args.substring(i));
  24. return arr;
  25. }

The only characters I'm interested in are the delimiter and the open/close grouping operators. (If you wanted to make this function a little more adaptable, you can easily swap the hard-coded ",", ")", and "(" values with parameters passed into the function.) All I'm doing is locating the next position of each of the three characters, and figuring out which one is closest to where the starting substring index is. Based on a simple depth counter, I can determine if a comma lives outside of all the groups in the expression, and choose to either ignore it or use it as a delimiter. As an added bonus, since I'm counting parenthesis anyway, I can also make sure the syntax is written correctly, and throw errors when something unexpected comes up. Anyway, since thing this is pretty useful and very self-sustained, I decided to throw it up here to share. Let me know if it helps you out!

One more thing: Keep in mind this function is only designed to handle depth 0 delimiters, so if you need to build a nested list of values, you'd have to run this recursively with each individual grouping.

2 Comments

» Leave a comment now

» RSS feed for comments on this post
» TrackBack URI

  1. 1

    If you need to get more complex, ANTLR has recently added Actionscript support:
    http://antlr.org/

    Here’s a great introduction:
    http://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/antlr/antlr.html

    Comment made by Dusty Jewett on March 5, 2008 @ 3:18 pm

  2. 2

    Looks pretty comprehensive, thanks Dusty!

    Comment made by Steve on March 5, 2008 @ 11:25 pm


Leave a Comment

  1. XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>