Digital Forensics Today Blog: Python

James Habben with Chet Hosmer

I discovered something very cool this year at CEIC: people actually read my blog posts! The realization came when I found out there were two sessions focusing on Python, and both of them talked about my #en2py techniques that I presented in this blog last year.

One of the sessions, Heuristic Reasoning with Python and EnCase, was presented by the Python forensics guy, Chet Hosmer of python-forensics.org. I got a chance to chat with him after his session, and the discussion led to what you are about to read. Chet has a number of Python scripts that can make a difference in forensic cases, and we decided a joint blog post would be a fun way to touch on the integration between EnCase and Python with another technique. This will be a two-part post with the first part focusing on getting the files out. The second will get some fancy on it by putting a GUI on the front to accept options in the processing. I will now let Chet explain the benefits of his work.

Function and Benefits of Heuristic Reasoning

Applying heuristics during deep-dive investigation allows us to apply rules of thumb during the process. In order to bring this to light, we chose to integrate a Python script that performs “what I call” heuristic indexing of binary files. Binary files like memory snapshots, executable files and photo graphic images have ASCII text embedded with the binary data. Extracting these “text sequences or remnants” and then making sense of them can be a challenge.

The issue with traditional approaches like dictionary comparisons or keyword lists, is the occurrence of misspelled words, slang, technical jargon, malware strings, filenames, and function names. These can all be missed because they are not in the dictionary or keyword list, an example is shown in the Casey Anthony investigation. Another traditional approach would be to report on all “text sequences or remnants” this can results in a voluminous number of nonsensical meaningless text strings that can overwhelm investigators.

My approach (originally outlined in my text, Python Forensics) uses a set of 400,000 common English words, (loosely a mini corpus of words) to generate a weighted heuristic model. I have since created additional models for medical and pharmaceutical domains and I’m working on common words used within text messages.

Using Python, I load the specific weighted heuristics into a Set. Then during the process of extracting “text sequences or remnants” from the binary file(s), the same algorithm is applied to each extracted sequence as was used to build the weighted heuristics. The calculated heuristic is then used as a lookup value. If the value is found in the loaded weighted set, then the word is considered probable and reported. One other final step I should mention…. most languages have what are referred to as “stop words” such as, (whenever, always, another, elsewhere etc). English is no exception. These stop words are filtered from the final list as they typically have little probative value. Each identified word that passes these filters is stored in a dictionary, one of the great built-in data structures within Python. Dictionaries are key, value pairs, in this case the key is the probable word string and the value is the number of times the word is discovered. This allows me to then produce a resulting list of probable words either sorted alphabetically or by frequency of occurrence.

Therefore, the bottom line benefits of heuristic indexing include:

Accurate identification of a broad set of probable words from binary data
Slang, technical jargon, filenames, misspelled words are also identified
Strings that represent nonsense strings are filtered out
Common stop words are ignored
The frequency of words found or alphabetical results are possible
New weighted heuristic models can be created

In order to apply this method more broadly to a case instead of a single file, we needed a method to allow EnCase (via an EnScript), to export multiple selected files to be processed by the Python script. I turned to James, the EnScript Guru for help.

Method of Choosing Files

In my previous posts, I used a simple technique in EnScript to send the highlighted file out from EnCase to the local disk to allow for a Python script to access the data. This works great for Python scripts that are designed to process one file at a time, but it is not very efficient for the examiner when that one file has not been pinpointed yet. There are many Python scripts out there that are designed to process a whole set of files in a designated folder.

In another post, I looped through files in the case, but I was targeting certain filenames known to contain evidence from Windows 8 Phone apps. The structure there is similar to what I have here, but the interaction with Python is the difference.

Chet and I talked at CEIC about how to do exactly this in EnScript, and came to the conclusion that the rest of the world should know about this as well! OK, maybe not the world, but I’m sure you appreciate that we didn’t keep this buried in some dark closet somewhere.

I have talked about ItemIteratorClass before, but it was in a simple post about the changes in EnScript from EnCase v6 to v7. This is the class that gives us access to all of the files in the case. There are a lot of options explained in that post, so I won’t drag it out here. The mode we will focus on is CURRENTVIEW_SELECTED, which will give us a collection of the files that the examiner has blue-checked in the EnCase interface before running the EnScript.

Because we are processing multiple files, the execution of the Python script needs to happen once the loop is complete. The loop will be doing the work of identifying selected files and exporting them to the disk.

EnScript Walkthrough

The usage of ItemIteratorClass starts off with setting some values in variables. I defined these as global variables for reasons you will see in part 2. The mode I chose here allows an examiner to blue-check any number of files in EnCase, and send this collection to the EnScript for export.

The NOPROXY is used because I am not looking to get any hashes calculated and it speeds up the loop. The NORECURSE option is also used to speed up the loop. With the mode using the current view, the recursing into compound files isn’t possible, anyway.

Then we enter into the loop to find all of the files. There's a fairly bulky chunk of code here, but it has a purpose behind it. When you are dealing with files from evidence, you are potentially pulling files from folders all over the drive. Chances are good you will find a couple files with the same name. On line 22, I am using a GUID that is generated by EnCase and is unique inside the evidence file. Lines 20-23 all together are modifying the filename to include this GUID, but also retain the same extension for identity.

There is a little irritation that pops up when you use any of the modes focusing on the current view. It locks that view in EnCase for the examiner running the EnScript while the iterator is active. Line 31 happens immediately after the looping export code, and this clears the iterator to release the view for the examiner while Python does its thing. Little things matter!

Depending on the Python script you are using and the amount of data you are processing, you may have to adjust the timeout value on line 41. If this value is not large enough, the output from Python will be either missing or cut short.

You're getting a two-for-one deal in this joint blog post, because now Chet is going to explain some Python code now. (I don’t want to read any complaints about the length of this post!)

Python Walkthrough

The overview of the Python script is shown in the figure below:

The Script employs a Heuristic Model created from one or more word dictionaries. Dictionaries and vernaculars can be expanded through the training of the model. The Heuristic Indexer receives selected file(s) from EnCase and then extracts possible word strings from each of the files. Heuristics are calculated for each extracted string and then compared against the Heuristic Model. The result is a report that is delivered back to EnCase.

For Part I of the blog I want to focus on the primary integration between James’ EnScript and the Python Heuristic Indexer.

The main entry point for the Python Script prints out some information messages and then obtains the path and individual filenames exported by the EnScript by parsing the command line arguments. Then for each file found, the IndexAllWords() function is called to perform the string extraction and subsequent Heuristic analysis. I have highlighted the key lines of the Python script.

Python Main Entry Point

# Main program for pyIndex

if __name__ == "__main__":

# Print Script Basics
print "\nHeuristic Indexer v 1.1 CEIC 2015"
print "Python Forensics, Inc. \n"

print "Script Started", str(datetime.now())
print

# Obtain the arguments passed in by the Enscript
# In Phase I the only argument passed is
# path where the EnScript copied the selected files

targetPath = ParseCommandLine()

print "Processing EnCase Target Path: ", targetPath
print

# using the targetPath, obtain a list of filenames
# using the Python os module

targetList = os.listdir(targetPath)

# Creating an object to process
# probable words
# the matrix.txt file contains heuristic model

wordCheck = classWordHeuristics("matrix.txt")

# Now we can iterate through the list of files
# Calling the IndexAllWords() function for each
# file. The IndexAllWords() performs the word
# extraction, heuristic processing and reports
# results back to EnCase via Standard Out

for eachFile in targetList:

fullPath = os.path.join(targetPath, eachFile)
print "####################################"
print "## Processing File: ", eachFile
print "####################################\n"

IndexAllWords(fullPath, wordCheck)

print "Script Ended", str(datetime.now())
print

# Script End

Results: So What Do I Get From All of This?

Here is a screen shot and an abbreviated excerpt from an actual EnCase / Python marriage.

Closing Thoughts

James: This was a new (and exciting) opportunity for me to have a guest author in a joint post. I am so happy to hear that my #en2py techniques have helped others. EnCase is a powerful platform on its own, but enhancing it with the libraries available in other languages and tools just makes everything that much better for examiners. I hope you find this useful and thanks for taking the time to read through this!

Chet: The catalyst behind Python Forensics, Inc. is to create a collaborative environment for the rapid development of new investigative scripts that can directly benefit investigators. I hope this blog will get you interested in developing and/or using EnScripts and Python in your next endeavor. I would like to thank James for his enthusiasm for the project and I look forward to Part II.

The Final Details

Download the EnScript here.
Download Chet's Python script here.
Look for an email invitation and announcements on Twitter about an upcoming webinar we're planning with Chet called, "EnCase and Python: Extending Your Investigative Capabilities."

Chet Hosmer
@PythonForensics
Founder of python-forensics.org

James Habben
@JamesHabben
Master Instructor at Guidance Software

James Habben

As a co-author and instructor for Guidance Software’s EnScript Programming course, I spend a lot of time teaching investigators in person around the globe. Investigators are faced with a dizzying variety of challenges. We work together in class, coming up with solutions that send EnCase off to do our bidding. EnCase and EnScript allow us to “bottle” the result of our efforts to share with other investigators (e.g. categorizing internet history, detecting files hidden by rootkits).

Python is used similarly. The interweb hosts great tools written in Python to accomplish all measures of tasks facing DFIR examiners. The community benefits from the hours of work that go into each and every .py that gets baked. It seemed to me that there should be a way for EnCase and Python to work together, so I put together a brief tutorial.