POSIX Regular Expressions in EnScript and .NET

James Habben

I am sure you have spent a little intimate time with EnCase doing keyword searches, so you know that EnCase has basic GREP capabilities. This is a powerful feature that allows for searches to be performed with patterns that can eliminate false positive hits. Recently, we hosted a webinar with guest Suzanne Widup, describing some techniques and benefits of using GREP in EnCase.

GREP is a term that comes from the Unix world long ago. It stands for Globally search for Regular Expressions and Print. This command line utility was used to search through data and print out results that matched the given pattern. Because of the popularity of the tool, the name has become synonymous with Regular Expressions (Regex). Though there is a defined standard, POSIX, the syntax of patterns used in Regex actually varies quite wildly depending on the platform engine and programming language that is being used. EnCase is no exception. In homage to our habit of prefixing our product names with “En”, I jokingly refer to our syntax of regex as “EnGrep.”

EnGrep has some limitations and differences in function. Before I show you why there is something so noteworthy as to call for a whole blog post about POSIX Regex, I would like to walk through a few of these differences. This is by no means an exhaustive list:

  • Subgroups: In Regex, a set of parentheses exposes the ability to retrieve matches that are restricted to the pattern inside those parentheses. This allows you to define a complex pattern that uses more criteria to locate and validate data, but only retrieve the parts that are relevant for results. EnGrep supports grouping characters with parentheses, but there is no mechanism for retrieving matches from within those groups.
  • Look-Arounds: This is a powerful feature that allows for matches to validate data that prepends or appends the relevant hit without affecting the size of the data of the actual result. This can sometimes be interchangeable with subgroups for ultimate functionality, but these are usually more efficient. EnGrep does not support these.
  • Pipe Grouping: This is not a feature, but a behavior difference. Using a pattern in EnGrep such as “Habben|Key|Lukach|Mizota” would find a result of “HabbeKeyukacMizota”. If you were trying for complete names, you would have to modify the pattern like this “(Habben)|(Key)|(Lukach)|(Mizota)” to get the intended result. With a POSIX-compatible engine, the first pattern takes on the behavior of the second automatically without having to place the groups around each of the names.
EnGrep has done a great job for examiners over the years, but it can be a bit frustrating to programmers looking for more exact results. This is especially true if you have come from using a language that has the full capabilities of Regex available.

.NET to the rescue


If you haven’t read through the previous blog post about .NET integration, why not take a few minutes now to understand how this works.

The power we are taking advantage of here comes from the .NET library at System.Text.RegularExpressions (http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx). We just have to put together a small c# project that essentially translates the .NET API functions over to EnScript land. Here are the API functions I have gone after:

  • Matches:
    • Parameters: Input text, Regex pattern, RegexOptions
    • Returns: MatchCollection object
  • Replace:
    • Parameters: Input text, Regex pattern, Replacement text, RegexOptions
    • Returns: string
While creating this project, I discovered a bit of a limitation in the integration of .NET and EnScript. Enumerated types do not like to transfer from .NET into EnScript. I tried working with the types directly in EnScript, and I tried defining my own custom types in my project namespace. Here is the full list of options.

My C# code then had to do a bit of work more than just simply exposing the API for Regex. I created two functions that accept a couple bool type parameters instead of the RegexOptions type that I couldn’t pass directly from EnScript. I chose two options that I felt would be useful. Here they are:
public static MatchCollection Matches(string input, string pattern, 
bool ignoreCase = true, bool multiLine = false ) {
System.Text.RegularExpressions.RegexOptions regexOptions = 0;
if (ignoreCase) regexOptions |= RegexOptions.IgnoreCase;
if (multiLine) regexOptions |= RegexOptions.Multiline;
return Regex.Matches(input, pattern, regexOptions);
}
 
public static string Replace(string input, string pattern, string replacement, 
bool ignoreCase = true, bool multiLine = false) {
System.Text.RegularExpressions.RegexOptions regexOptions = 0;
if (ignoreCase) regexOptions |= RegexOptions.IgnoreCase;
if (multiLine) regexOptions |= RegexOptions.Multiline;
return Regex.Replace(input, pattern, replacement, regexOptions);
}
If you like my implementation, you don’t need to create a c# project, since I’m providing the DLL to reference in your EnScripts down below.

Here comes the fun part


You need to place the DLL file somewhere (I place it beside the EnScript for ease) and then put a line at the top of your EnScript.
assembly “RegexLib.dll”
Now compile the EnScript and you’ll find the new Regex functions in the class browser.



Define an object to collect the matches:
System::Text::RegularExpressions::MatchCollection matches;
Because I defined the functions as static, you call them like this:
matches = RegexLib::RegexClass::Matches(...);
Then you can list the matches in the collection like this:
foreach (System::Text::RegularExpressions::Match ma in matches) {
  Console.WriteLine("match[{0}]: {1}", ma.Index(), ma.Value());
}
Let’s put this all together and create a scenario so we have something to search for. You have parsed internet history from a suspect drive. He is accused of being a weirdo that actually enjoys Windows 8. He is also a privacy nut, so he uses duckduckgo.com as his primary search engine. Our task is to give a quick report of the searches performed on this computer.

We could build an EnGrep pattern to find the appropriate URL matches, but we’re going to highlight the power of sub matches to display data that is easier to read. Here’s what the target URL format looks like:
https://duckduckgo.com/?q=encase
Like most search engines, they use the q= parameter for the queries typed by the visitor. Query string values are delimited with & signs, and URLs are terminated with whitespace. Now that we are using POSIX Regex, we have some additional symbols like \S that represent non-whitespace characters and \s for whitespace characters. I have colored the grouping parentheses in red. This should do the trick:
https?://duckduckgo.com\S+q=([^&\s]+)\S*
In the attached demonstration EnScript, I’ve defined a list of URL values using NameListClass objects so they would be similar to iterating through URL history records that have been parsed out by the Evidence Processor. Remember that when you want to put a backslash into EnScript source code, you have to escape it.
System::Text::RegularExpressions::MatchCollection matches;
matches = RegexLib::RegexClass::Matches(url.Name(), 
"https?://duckduckgo.com\\S+q=([^&\\s]+)\\S*", true, false);
Since I have used the sub groups, we have to get a bit fancier in the code to access the value of the query and pull it out from the rest of the URL. There is a property of the MatchCollection that exposes the groups called Groups(). The Item() method allows us to address a single group from the collection. Group #0 is always going to be the whole match. I have only one group defined inside the pattern, so I know that group #1 will be the query text that we are looking for.
foreach (System::Text::RegularExpressions::Match ma in matches) {
  Console.WriteLine("Search Phrase: {0}", ma.Groups().Item(1).Value());
  Console.WriteLine("     Full URL: {0}", ma.Value());
}
The last bit before we finish up is to remove those + signs that make the queries a bit more difficult to read. This will take care of that:
String CleanTerms (String input) {
  input.Replace("+", " ");
  return input;
}
So the final result in the console looks like this:
Search Phrase: man eating cockroaches
     Full URL: https://duckduckgo.com/?q=man+eating+cockroaches
Search Phrase: what seasoning to cook monkey brains
     Full URL: https://duckduckgo.com/?q=what+seasoning+to+cook+monkey+brains
Download the example EnScript and .NET DLL file here.

There is a down side to the Regex available through the .NET libraries. The functions available only accept string input. That means we can use Regex when looking for text based patterns, but EnGrep still reigns king when searching for patterns like file headers.

Do you have an idea for using POSIX Regex in EnScript? Let me know in the comments or on Twitter.

James Habben
@JamesHabben

No comments :

Post a Comment