Regex is inevitable! Maybe it’s not the tool for everyday use, but I am finding myself reaching for it a lot more than usual lately. There were two main ways I was getting my regex:

  1. For complex regex like emails and password validation, stackoverflow.com has many posts.
  2. For a more specific regex, an LLM usually handles the job.

And that was it! Never really thought about it much more than that because I had moved on to the next task. However, it occurred to me recently that starting to learn it could prove to be an interesting and helpful side quest on my software writing journey!

Behold: A side quest! ⚔️

During my normal workday with Playwright .NET RPA, I encounter a lot of situations where I have an ID or number that I need to match to another in the form of strings (since they either come from HTML or JSON). In this particular instance, the number to match came from an API as “001”, but in the other system, grabbing records via RPA, it was “EXT001". I smell a REGEX quest! This was the perfect opportunity, I felt, to learn and effectively start writing my own Regex.

The Initial Problem


The task is straightforward: extract the number '001' from 'EXT001' and '100' from 'TRN100'. But to play it safe, I spent some time working through RegexOne’s interactive lessons to understand the syntax and patterning. Then I started to experiment on https://regex101.com/ (which I highly recommend, it’s great):

  • For 'EXT001', I settled on @“^EXT(\d+)$”.
  • For 'TRN100', I similarly settled on @“^TRN(\d+)$”.

These patterns are great, and here’s how they work:

  • ^EXT or ^TRN: Matches the literal string 'EXT' or 'TRN' at the start.
  • (\d+): Captures one or infinite digits into a group.
  • $: Ensures the pattern matches the entire string.

These patterns reliably extract '001' or ‘111’ or ‘123’ etc. Great, time to turn in the quest, right? …right? Why are you shaking your head? 🥲

Curveball: The Side Quest Continues

Just when I thought I had it figured out and got it deployed in production:

“EXTINT-001” joined the chat...

Super cool! Let’s get this bread!! Now, I needed to extract 'INT-001', but also STILL '001'. And @“^EXT(\d+)$” was not gonna cut it for the new 'INT-' part.

First Attempt: Missing the Mark

csharp
@“^EXT(?:[A-Z]{3}-)?(\d+)$”

This was my first attempt to handle 'EXTINT-001'. This pattern has some new stuff in it, I was experimenting with from regex101.com like:

  • {}: Let's you define the absolute count of chars or a range. Ex. {3,6}
  • (?:)?: A non-capturing group that optionally matches but does not group.

Neat! But it doesn’t work. Why?

This pattern only captures '001' from 'EXTINT-001', and excludes the 'INT-' part, because the capturing group only included the digits. I needed the entire 'INT-001' in one group. Whack.

Second Attempt: No dice

csharp
@“^EXT(?:([A-Z]{3}-\d{3})|\d{3})$”

My next attempt. This pattern uses the alternation operator to match either 'INT-001' or '001':

  • ([A-Z]{3}-\d{3}): Captures three letters, a hyphen, and three digits.
  • |\d{3}: Alternatively matches three digits.

This worked for 'EXTINT-001', capturing 'INT-001', but I forgot about the original ‘001’ for 'EXT001' because it required exactly three digits or the 'INT-' structure. Since 'EXT001' could have a variable number of digits, this pattern was too restrictive.

Third Attempt: Getting Closer

csharp
@“^EXT((?:[A-Z]{3}-)?(\d+))$”

Third time's a charm, sometimes. This pattern was closer to the solution:

  • (?:[A-Z]{3}-)?: Optionally matches three letters and a hyphen.
  • (\d+): Captures the digits.
  • The outer (): Captures the entire group, including the optional 'INT-'.

This pattern worked but created two capturing groups which created unecessary complexityyyy: one for 'INT-001' and another for '001'. I wanted a single group that captured the entire desired output, whether it was '001' or 'INT-001', so I could grab the value like you see in the example below.

Ex.

csharp
var matches = Regex.Match(rpa_obj.CMM_CODE, @”^EXT((?:[A-Z]{3}-)?(\d+))$”)
var num = matches.Groups[1].Value

Final Solution: Let’s goooo

csharp
@“^EXT((?:[A-Z]{3}(?:[-])?)?\d+)$”

After several iterations and extensive testing on regex101.com, we made it y’all! The final pattern! It successfully handles both 'EXT001' and 'EXTINT-001' and even ‘EXTINT001’:

  • ^EXT: Matches the literal 'EXT' at the start.
  • (: Starts a capturing group for the desired output.
  • (?:: Starts a non-capturing group.
    • [A-Z]{3}: Matches exactly three uppercase letters (e.g., 'INT').
    • (?:[-])?: Optionally matches a hyphen.
  • )?: Makes the entire non-capturing group optional.
  • \d+: Matches one or infinite digits.
  • ): Ends the capturing group.
  • $: Ensures the pattern matches the entire string.

Testing the Pattern


Again, to verify the pattern, I used regex101.com, and here’s how it performed:

InputCaptured GroupExpected OutputMatch?
EXT001001001Yes
TRN100100100Yes
EXTINT-001INT-001INT-001Yes
EXTINT001INT001INT001Yes

The trick was realizing that the digits were ALWAYS present, just like the ‘EXT’. The other aspects are optional, and as such, non-capturing groups are perfect. Now watch something like this pop up in prod next lol

“EXTINTZZZ-4-1.OK”

Lessons Learned

This was a great side quest, I’m really happy with the new bits I learned regarding regex in .NET:

  1. Optional Groups: The ? quantifier makes a group optional, as seen in (?:[A-Z]{3}(?:[-])?)?.
  2. Non-Capturing Groups: Using (?:...) ensures that parts of the pattern aren’t included in the captured output, keeping the result you want clean.
  3. Testing is Crucial: Tools like regex101.com are essential for debugging outside your code and understanding how patterns behave.
  4. Avoiding The Give-Up-and-Ask-ChatGPT Approach: This took me multiple attempts to get right. Each failed pattern taught me something new, but I kept experimenting, and you can too!

Resources

Remember to commit early and often, see ya next time! 🤙