Tuesday, 27 August 2013

A Powershell tool for deleting lines in a csv file based on another csv file

A Powershell tool for deleting lines in a csv file based on another csv file


The other day one of the daily processes that we have to do as a part of our IT Operations failed spectacularly. You see we have to upload a csv file to a transaction processing system so that it could maintain customer information just in case our primary system goes offline.  The problem that we had was that when we tried to upload the file to this system it failed because there were duplicate records already loaded into the database. By the way the systems are dissimilar systems and so some form of replication or log shipping would not have worked in this case. 



This is something that we had seen before and to fix this problem we would run a query against the database to find the duplicate records and then delete the one or two duplicate lines from the file. However this time when we ran the query instead of getting one or two records it returned a few hundred records. This certainly was not good news for us, because there was no way that we could delete a few hundred duplicate lines from the file in time to complete the file upload. Or was there?



We decided that our best option at that point was to find away to automate the deletion of the duplicate lines from the source file and to do that we turned to Powershell. A quick google and we came across this posting on the StackOverflow website that demonstrated how to delete duplicate lines present in one csv file from another csv file. We created our script based on the answer provided by Graimer.


Long story short we were able to delete the duplicate lines from the source file quite easily and then we were able to successfully upload the cleaned file into our transaction processing system. PHEW! Powershell to the rescue once again. However the story does not end there because we knew that we could face this problem again and we decided to turn the script into a reusable tool that could be used by any member of our team, and this post describes that process.



Essentially what we want to do is to convert these few lines of Powershell script in Figure 1 into a full fledged GUI tool in Figure 2.


 
Figure 1: The script used to remove rows present in one cvs file from another csv file.


Figure 2: eVeto UI


To move from the basic script to the GUI based tool we must first define the specifications of the tool:


  • User interface (UI) – The IT Operations staff require an easy-to-use UI to quickly create the cleaned file for uploading.
  • Automated formatting -  The tool must ensure that the exception file is in the correct format before it is used as the basis for deleting duplicate lines in the source file.
  • Progress and error reporting – The tool must provide feed back to the user on the outcome of the execution of a task.


The UI tool consist of three Powershell scripts and these are:


  • Start-eVeto.ps1 – This script is used to start the eVeto UI tool. It uses the Start-Process cmdlet to create a new instance of Powershell that launches the eVeto UI.
  • Initialize-eVetoUI.ps1 – This script actually creates the eVeto UI from and XAML file and ties the named controls events to the functions in the Invoke-eVetoUIEvents.psm1 module.
  • Invoke-eVetoUIEvents.psm1 – This module provides the functions used by the eVeto UI tool.



The eVeto tool can be downloaded from this link.



Rendering the UI



The UI will be based on the Windows Presentation Foundation (WPF) technology. To design the interface we will use the freely downloadable Microsoft Visual Studio Express 2012 for the Windows Desktop. Visual Studio Express 2012 will generate an XAML file named MainWindow.xaml for the UI that we create. This XAML file is actually based on XML and we will use the Get-Content cmdlet to read the file into Powershell as an XML data object. The next thing that we will do is to create a System.XML.XMLNodeReader object to access the XML object that was previously created. We will then use the Load method of the Windows.Markup.XAMLReader to get the root of the XAML file. We save the reference to the root of the XAML file in a shared hashtable. The next thing that we do is to loop through the XML object and save a reference to the named controls in the hashtable as described in this post. This is shown in Figure 3.


Figure 3: Loading the XAML file into Powershell


Now before you go off using Visual Studio Express 2012 to create your Powershell UI you have to be aware that we have to remove the attribute x:Class="ProjectName.MainWindow" (where ProjectName is the name of actual Visual Studio Express 2012 project) from the MainWindows.xaml file. The x:Class="ProjectName.MainWindow" is only required by Visual Studio.  In our example the attribute would be x:Class="ProcessExceptions.MainWindow" as shown in Figure 3. If the attribute is not removed then Powershell will not load the XAML and will return the error shown in Figure 5.

Figure 4: MainWindow.xaml file In Visual Studio Express 2012

Figure 5: Error caused by x:Class="ProjectName.MainWindow" attribute

It is important to note that for a WPF based application to work in Powershell the Powershell instance has to be started as a Single Threaded Apartment (STA). This is the default setting in Powershell 3.0. However Powershell 2.0 starts as Multi-Threaded Apartment (MTA) by default. This means that a WPF based application would not work in Powershell 2.0. To get around this problem we would have to start up Powershell.exe with the -sta option in Powershell 2.0.


To make the UI tool wok in both both Powershell 2.0 and Powershell 3.0 we will execute the UI tool in a separate Powershell Runspace that has its ApartmentState set to STA. More on that later on.


Now that we have a reference to the named controls in the hashtable. We will bind the Click Event of the buttons to functions that we import from the Invoke-evEtoUIEbvents.psm1 module. The format for binding to a WPF control event is quite easy and it takes the form ControlName.Add_EvenetName ({ script block }). So to tie Click Event of the button named ImportSourceFile to the function Invoke-ImportSourceFile we would use the following Powershell code:


The complete code listing for binding the WPF UI Events to the functions in the Invoke-eVetoUIEvents.psm1 module is shown in Figure 6.


Figure 6: Binding the WPF button click events to Powershell functions

After we have bound the button Click Events to the Powershell functions we will display the WPF UI by calling the ShowDialog method of the Window.




Earlier in the blog post I had mentioned that we created a separate Powershell Runspace to execute the WPF UI code and that we set the ApartmentState of this runspace to STA.  The idea behind this is to ensure that we free up our Powershell process from the activities related to the WPF UI. So if the WPF UI were to crash it would not bring down the Powershell process. The other thing that we do is to create a synchronized collection, in this case a hashtable, that we can pass between the main Powershell process and any Powershell runspace that we might create. The hashtable will be used to hold the reference to the WPF UI elements, and we will be able to access these WPF UI elements in the Powershell process or other Powershell runspaces that we might create. However we should note that the Powershell Runspace that is executing the WPF code is running in a different thread and if we tried to access the WPF UI elements directly from the Powershell process or another Powershell runspace we will receive the following error:


“Invalid Operation Exception: The calling thread cannot access this object because a different thread owns”


This happens because the thread that creates the WPF UI elements owns the elements and other elements cannot interact with the WPF UI elements directly. This is know as thread affinity. To get around this problem we will use the Dispatcher and the Invoke method to update the WPF UI. This problem has been blogged about and an example of this can be found here.


In our UI tool we have created a hashtable called $eVetoUI that we use to hold the references to the WPF Window and the named WPF controls. We pass the $eVetoUI hashtable between the Powershell runspaces by using the SessionStateProxy.SetVariable method of the Powershell runspace.


We use the [Powershell]::Create().AddScript method to create a Powershell object and to assign a ScriptBlock to it. We then assign the Powershell object to the Powershell runspace that we created earlier and then we call the BeginInvoke() method of the Powershell object to execute the Scriptblock that displays the WPF UI.




Formatting the Exception File


One of the problems that we encountered when we first created the script, to delete the duplicate lines in the source file, was that the exception file that we created was not in the same format as the source file. So when we executed the script and it tried to find a line in the source file that was the same as the exception file it could not do so because the lines were not the same. Figure 7 shows a sample of the format of the exception file.


Figure 7: Sample of the Exception File

We got around this problem by using Notepad to edit the exception file so that the lines were formatted similarly to those in the source file as shown in Figure 8. However we wanted to be able to automate the formatting of the exception file so that we would not have a similar problem in the future.


Figure 8: Sample of the Source File

In looking at the format of the source file we realized that we could use a simple string replace to remove the NULL and the 00:00:00.000 from the exception file to match the source file. However the formatting of the date would prove to be a bit trickier, but it was a problem that a regular expression could solve. The regular expression that we came up with was “(\d{4})(-)(\d{2})(-)(\d{2})”. What this means is to match any part of a string that has 4 digits, then a hyphen, then 2 digits, then a hyphen, then 2 digits. If we are doing a regular expression replace in Powershell then the captured patterns are stored in special variables $1, $2. $3, etc. We can use these special variables in the regular expression replace to get the date formatted in the way that we want and in this case the date format 20130704 would be represented by the variables $1$3$5. A string replace of the hyphens in the date could have worked as well, but the regular expression is much cooler.



We wanted to have a generic solution that we could use to format other files for other scenarios, and so we decided to use an XML based file to hold the formatting instructions as shown in Figure 9.


 
Figure 9: The XML File with formatting instructions

When we parse this XML file we use a string replace with any child node under the StringFormatting, and we will use a [regex]::Replace with any child node under the RegexFormatting node. The code for this is shown in Figure 10.

Figure 10: Formatting the Exception File


The other thing that we did was to always check that the exception file was in the correct format before we attempted to delete the duplicates from the source file.

Figure 11: Checking that the exception file is correctly formatted



From Basic Script to Reusable Tool



In Figure 1 we showed you the basic Powershell script that we wanted to convert into a tool that any one could use. We decided to wrap the tool in a WPF UI to make it easier for other persons to use. If you look at the basic Powershell script you will see that it needs:



  • The path to the source file
  • The path to the exception file
  • The path for creating the cleaned source file
  • The name to give the cleaned source file



The basic script also does not catch any errors and provides no feedback to the user. The basic script evolved into the Invoke-RemoveExceptions function. This function is tied to the Click Event of the Remove Exceptions button. The Invoke-RemoveExceptions function does the following:



  • Retrieves the path to the source file from the eVeto UI
  • Retrieves the path to the exception file from the eVeto UI
  • If any of these paths are not provided it returns an error and exists
  • It checks that we have not previously created a cleaned source file. If we have the function ends.
  • It checks that the exception file is correctly formatted. If the exception file is not correctly formatted then the function ends.
  • It uses the [System.IO.File]::ReadAllLines method to read in the source file and the exception file into hashtables. Each line of the file is the key of the hashtable and the value of the hashtable is set to true. We use the [System.IO.File]::ReadAllLines method to open the file because it does so quickly.
  • It gets all of the keys in the exception hashtable and then tires to find a similar key in the source file hashtable. If the key is found it is deleted from the source hashtable and the evEto UI is updated when the duplicate line is deleted.
  • If we have deleted duplicate lines from the source hashtable it is saved to a file with the same name as the source file but with a .new extension appended to it.


The code listing for the Invoke-RemoveExceptions is shown in Figure 12.

Function Invoke-RemoveExceptions { <# .Synopsis This function is used to remove the exceptions from the source file. .Description The source file is compared to the exception file and similar lines are removed. The processed source file is saved with the same name as the original source file with extension new being added to the file. #> try { if (([String]::IsNullOrEmpty($eVetoUI.Control_SourceFile.Text)) -or ([String]::IsNullOrEmpty($eVetoUI.Control_ExceptionFile.Text))) { [System.Windows.MessageBox]::Show("Please import a source file and exception file for processing") return } $result = [System.Windows.MessageBox]::Show("Please ensure that you have formatted the exception file before continuing"
 "Warning", [System.Windows.MessageBoxButton]::YesNo) if ($result -eq 'Yes') { # Check that we have not already created a new source file $NewSourceFile = ("{0}.new") -f , $eVetoUI.Control_SourceFile.Text # If the backup file already exists then bail out if (Test-Path $NewSourceFile) { $FileExistMessage = "The file $NewSourceFile already exists.`n" $FileExistMessage += "Please remove or rename the file inorder to continue." [System.Windows.MessageBox]::Show($FileExistMessage) return } # Check that the exception file has been properly formatted # If the formatting file does not exist bail out $FormatExceptionFile = '.\formatexceptionfile.config' if (-not(Test-Path $FormatExceptionFile)) { $FileExistMessage = "The formatting file $FormatExceptionFile does not exist.`n" $FileExistMessage += "Please create the formatting file inorder to continue." [System.Windows.MessageBox]::Show($FileExistMessage) return } # Import the exception file $ExceptionFileTest = Get-Content $eVetoUI.Control_ExceptionFile.Text -ErrorAction Stop | Out-String $ExceptionFileIsOk = $true # Import the exceptions formatting rules from a config file $xmlformatting = [xml](Get-Content $FormatExceptionFile) # Check that the exception file has been correctly formatted if ($xmlformatting.formatting.StringFormatting.HasChildNodes){ foreach ($string in $xmlformatting.formatting.StringFormatting.String) { $stringformatting = $string.Split(";") if ($ExceptionFileTest.Contains($stringformatting[0].ToString())) { $ExceptionFileIsOk = $false } } } if ($xmlformatting.formatting.RegexFormatting.HasChildNodes){ foreach ($regex in $xmlformatting.formatting.RegexFormatting.Regex) { $regexformatting = $regex.Split(";") if ($ExceptionFileTest.Contains($regexformatting[0].ToString())) { $ExceptionFileIsOk = $false } } } if (-not ($ExceptionFileIsOk)) { [System.Windows.MessageBox]::Show('The exception file is not in the correct format. 
Please format the exception file using the Format Exception File button.') return } $eVetoUI.Control_Output.AppendText("`r`n`r`n") $eVetoUI.Control_Output.ScrollToEnd() $eVetoUI.Window.Cursor = [System.Windows.Input.Cursors]::Wait $eVetoUI.Control_Output.Cursor = [System.Windows.Input.Cursors]::Wait # The hash table used to process the files $Source = @{} $Exception = @{} # Import the files into the hash tables Write-PSOutput -ConsoleText "Importing Exception file ... `r`n" [IO.File]::ReadAllLines($eVetoUI.Control_ExceptionFile.Text) | % { $Exception[$_] = $true } Write-PSOutput -ConsoleText "Importing Source file ... `r`n" [IO.File]::ReadAllLines($eVetoUI.Control_SourceFile.Text) | % { $Source[$_] = $true } Write-PSOutput -ConsoleText "Removing exceptions from the source file ... `r`n" $ProcessedExceptions = $false foreach ($key in $Exception.keys) { #$eVetoUI.Control_Output.AppendText(".") #Remove row if it exists in source file if ($Source.ContainsKey($key)) { $ProcessedExceptions = $true $dupline = ("Removing duplicate line {0} `r`n") -f , $key Write-PSOutput -ConsoleText $dupline $Source.Remove($key) } } # Save the source file if ($ProcessedExceptions) { Write-PSOutput -ConsoleText "Saving the new source file as $NewSourceFile ... `r`n" $Source.Keys | Add-Content $NewSourceFile -Force -ErrorAction Stop [System.Windows.MessageBox]::Show(" Removal of exceptions completed.`n The file was saved as $NewSourceFile. ") } else { [System.Windows.MessageBox]::Show("No exceptions were removed from the source file.") } $eVetoUI.Control_Output.ScrollToEnd() $eVetoUI.Control_Output.Cursor = [System.Windows.Input.Cursors]::Arrow $eVetoUI.Window.Cursor = [System.Windows.Input.Cursors]::Arrow } else { return } } catch { if ($_.Exception.InnerException) { $ExceptionMessage = $_.Exception.InnerException } else { $ExceptionMessage = $_.Exception.Message } $eVetoUI.Control_Output.AppendText($ExceptionMessage) $eVetoUI.Control_Output.ScrollToEnd() $eVetoUI.Control_Output.Cursor = [System.Windows.Input.Cursors]::Arrow $eVetoUI.Window.Cursor = [System.Windows.Input.Cursors]::Arrow } finally { Remove-Variable -Name $Exception Remove-Variable -Name $Source } }
Figure 12: Code listing for the Invoke-RemoveExceptions function

The Invoke-RemoveExceptions function is wrapped in a try, catch block so that we can catch any exceptions that occur and update the eVeto UI. In addition to this we use the -ErrorAction Stop parameter with the Get-Content and the Add-Content cmdlets. This is because the try, catch block will not trap the non-terminating errors which are generated by these cmdlets, and the only way to catch these errors is by treating them as a terminating error.


In the UI tool we use a textbox named Output to display the results of an action or error messages. This textbox is updated by a function named Write-PSOutput. The update to the textbox happens by using the Dispatcher, and this allows us to access the textbox from the Powershell process.  




Summary


Turning a basic script into a UI tool is a pretty involved process, but the benefits of doing so far outweighs the time that it takes to do so. In our case we now have a tool that any of our IT Operational staff can use to quickly create a cleaned source file that we can upload into our transaction processing system. In addition to that we have made the tool generic enough to be used in other scenarios that would require deleting lines from a csv file.