Subscribe Now: Feed Icon

Thursday, March 24, 2011

New Computer

Well not really a new computer but a new Hard Disk with a new operating System, I just moved from Windows 7 32Bit to Windows 7 64Bit. My laptop has 4GB of memory so the transition was just a matter of time (the 32Bit could only use ~3GB of memory). I preferred now because my computer was giving me a hard time with weird exceptions that just wouldn’t go away…


So what did I install?

  1. IIS – if you install it before installing VS some options will be missing
  2. VS2008 – still have some legacy code and the guys in charge don’t want to upgrade to 2010, go figure…
  3. Visual Studio 2008 Team Foundation Client
  4. VS2008 SP1
  5. VS2010 – if you have an installation of SQL Server 2008 then don’t install the SQL Server Express 2008 (it makes it very difficult to install the Management or a real server down the line)
  6. VS2010 SP1
  7. Silverlight Tools for VS 4 2010
  8. Silverlight 4 For Developers
  9. Microsoft Expression Blend Software Development Kit (SDK) for Silverlight 4
  10. Silverlight Toolkit
  11. fiddler2 – a must have to Silverlight developers
  12. SQL Server 2008 – now here I had some troubles, it seems there is a bug in the installation that causes errors if the installation windows is not on top (I took it a step further and closed all other windows)
  13. ArcEngine 10 for developers
  14. ArcObjects SDK for the Microsoft .NET framework
  15. ArcGIS License Manager
  16. ArcGIS API for Silverlight/WPF version 2.1
  17. MyGeneration – I use it for my Code Generation
  18. Visual Studio 2010 Feature Pack 2
  19. DB Comparer – a free tool to compare two databases
  20. ArcSDE for Microsoft SQL Server
  21. ArcGIS Server for the Microsoft .NET Framework – GIS Services
  22. ArcGIS Server for the Microsoft .NET Framework – Web Applications
  23. Windows Live Writer – really need to write those blog posts
  24. Notepad++
  25. DropBox
  26. NUnit

VS Extensions:

  1. Resharper – 6 EAP version
  2. Spell Checker
  3. All Margins
  4. PowerCommands for Visual Studio 2010


This time I think I will backup my system, it just too long of a list to do this again…

Wednesday, March 23, 2011

Debugging Tip

Unit tests are good and nice but there comes a time where you must run your application and see if it works. Some programs are more easy to debug: you just mark them as the startup project, add a breakpoint and you are done.

In other places like a web project you need to attach the debugger to the relevant process: w3wp.exe

But when dealing with Windows Service and when that service falls with an exception on startup or it’s running is really fast, you need something else.

That something is the Debugger class, all you have to do is add a using for System.Diagnostics and add the code:

  1. Debugger.Launch();

when next you run your application/service you will get the nice debugger window:


Just choose your debugger and you are done.

This works with all the project types so if you feel like using it with a console application (like I just did) you can.

(My bug was of course forgetting a little “-“ in the parameters for the console application…)


Keywords: C#, debug, windows service

IceRocket Tags: ,,

FluentMigrator: Introduction

Well first of all how much time do you spend upgrading your DB from version X to X+1?

Is it because you are doing a lot of actions?

Is it because the server is slow or too far?

How do you manage the process (the order of the steps and the scripts)? Excel? TFS?

Who does the Upgrading? A DBA? A programmer?

Who writes the scripts to upgrade the code? A DBA? A programmer? (I once had a system where a programmer wrote the scripts but an applicative DBA (me and two others) checked them)

Do you want something different?

No – Ignore this: if you are satisfied with the current status quo please don’t change it. This change is only useful if you need it.

Yes – Read this


FluentMigrator is a programming interface which allows quick and easy creation of changes in the DB for programmers with next to no experience in SQL (you will need some experience in a general DB system). All you do is save blocks of changes to DB in code with a marked version. This allows you quick and easy upgrading and downgrading your DB between versions.

All the coder need to do is write a class, like this:

  1. [Migration(201008240930)]
  2. public class User : Migration
  3. {
  4.     public override void Up()
  5.     {
  6.         Create.Table("USERS").InSchema("GIS")
  7.             .WithColumn("Id").AsInt32().PrimaryKey().Identity()
  8.             .WithColumn("Name").AsString();
  9.     }
  11.     public override void Down()
  12.     {
  13.         Delete.Table("USERS").InSchema("GIS");
  14.     }
  15. }

With an attribute with a long for the migration version number.

The class implements the Migration abstract class.

The Up method contains changes to the DB for upgrading the version.

The Down method contains changes to the DB for downgrading the version.


The syntax is very simple, after writing Create. You get by the wonder of IntelliScense the options of:

 FluentMigrator-framework-create(Column, ForeignKey, Index, Schema, Table)

After Choosing table you get:

FluentMigrator-framework-create-table-introduction (InSchema, WithColumn)


With every action you get a different set of actions you can do from this point on. You already set the Schema? well FluentMigrator won’t let you enter it again, you skipped the Schema? FluentMigrator will enter the user’s default schema



On the next post I will write how FluentMigrator manages that in their code.


And that’s it. The console application is given to you from the framework.

Upgrading is as simple as running:

migrate -a Company.DbDeployment.Migrations.dll -db SqlServer2008 -conn "Password=PASS123;Persist Security Info=True;User ID=USER123;Initial Catalog=DB_NAME;Data Source=localhost" -profile "Debug" -t=migrate --version=23452351

Downgrading is as simple as running:

migrate -a Company.DbDeployment.Migrations.dll -db SqlServer2008 -conn "Password=PASS123;Persist Security Info=True;User ID=USER123;Initial Catalog=DB_NAME;Data Source=localhost" -profile "Debug" -t=rollback:toversion --version=0

(I actually enhanced the console a bit by removing the “-t” and relying only on the version number)


Well now lets answer the questions from the top of the post:

For the teams I have been in:

  • The process of upgrading (we never did a downgrading) took between a few hours to a day, but when the server was far or slow that time usually multiplied.
  • The list of actions to do was usually managed in Excel (scripts were saved on the TFS) and usually had lots of manual tasks. My current team used a notebook/human memory approach…
  • The upgrading was always done by our applicative DBA.
  • The scripts were written by programmers and DBAs.

With FluentMigrator:

  • The Scripts are written by the programmers in code (right now it is only me but I am working on that…)
  • Can be run by everyone (even the IT guy (who doesn’t know what is an index) can do it).
  • The list of actions is managed in code in the TFS.
  • The whole process is automatic meaning we don’t waste the time of a DBA
  • The whole process can be run on the DB server meaning distance won’t be an issue (but if the server is slow it can’t be helped)


In the next several posts I am going to write about how the inside of the framework is written, how to extend the framework and even give you my extension for Migrating SDE layers (unfortunately I can’t share ArcEngine Dlls in Github because of licensing rights, so sharing the code that way will only mean it will never compile – a waste of time…).



Getting Started With FluentMigrator : Your first migration

Github: FluentMigrator Project


Keywords: FluentMigrator, Database upgrade, DB upgrade, upgrade, downgrade, Fluent, example, unit test, console,slow server, distance, DBA, TFS

Starting with MVVM (in Silverlight)

For the past two weeks I have been working on both a new feature and converting the existing project to work with MVVM. Until now the XAML had code behind to handle all of the functionality and even for some stuff like combo box items we used code behind (where data binding would have been a lot easier).

I decided to spend some time and start converting the project. One of my teammates advised me to check out the MVVM Light framework. He even pointed me towards a good presentation in MIX 2010 on this subject.

So where should you start?

I found the best place to start is in a control/window that has simple actions. It is easy to just do the data binding without any care and a great stepping stone for other types of controls.

For example lets implement the classic – CalcControl. The control is just going to add two numbers when the Add Button is clicked.

The MVVM class:

  1. public class CalcViewModel:ViewModelBase
  2.  {
  3.      public RelayCommand AddCommand
  4.      {
  5.          get;
  6.          private set;
  7.      }
  9.      public CalcViewModel()
  10.      {
  11.          FirstNumber = 0;
  12.          SecondNumber = 0;
  14.          AddCommand = new RelayCommand(Add);
  15.      }
  17.      private void Add()
  18.      {
  19.          Result = FirstNumber + SecondNumber;
  20.      }
  22.      private const string FirstNumberPropertyName = "FirstNumber";
  23.      private int _firstNumber;
  25.      public int FirstNumber
  26.      {
  27.          get { return _firstNumber; }
  28.          set
  29.          {
  30.              _firstNumber = value;
  31.              RaisePropertyChanged(FirstNumberPropertyName);
  32.          }
  33.      }
  35.      private const string SecondNumberPropertyName = "SecondNumber";
  36.      private int _secondNumber;
  38.      public int SecondNumber
  39.      {
  40.          get { return _secondNumber; }
  41.          set
  42.          {
  43.              _secondNumber = value;
  44.              RaisePropertyChanged(SecondNumberPropertyName);
  45.          }
  46.      }
  48.      private const string ResultPropertyName = "Result";
  49.      private int _result;
  51.      public int Result
  52.      {
  53.          get { return _result; }
  54.          set
  55.          {
  56.              _result = value;
  57.              RaisePropertyChanged(ResultPropertyName);
  58.          }
  59.      }
  60.  }

The Control:

  1. <UserControl x:Class="MyControl.CalcControl"
  2.     xmlns=""
  3.     xmlns:x=""
  4.     xmlns:d=""
  5.     xmlns:mc=""
  6.     mc:Ignorable="d"
  7.     d:DesignHeight="300" d:DesignWidth="400" xmlns:toolkit="">
  9.     <Grid x:Name="LayoutRoot" Background="White">
  10.         <StackPanel HorizontalAlignment="Center" VerticalAlignment="Center">
  11.             <TextBlock Text="First Number"/>
  12.             <toolkit:NumericUpDown Height="22" Value="{Binding FirstNumber, Mode=TwoWay}" HorizontalAlignment="Left" VerticalAlignment="Top" Width="88" />
  13.             <TextBlock Text="Second Number"/>
  14.             <toolkit:NumericUpDown Height="22" Value="{Binding SecondNumber, Mode=TwoWay}" HorizontalAlignment="Left" VerticalAlignment="Top" Width="88" />
  15.             <Button Content="Add" Command="{Binding AddCommand, Mode=OneWay}" Width="88" />
  16.             <TextBlock Text="{Binding Result, Mode=OneWay}"/>
  17.         </StackPanel>
  18.     </Grid>
  19. </UserControl>

(Just binding the values TwoWay – important, the result – OneWay, and binding the Button to the command)

Using it is as simple as:

  1. <MyControl:CalcControl x:Name="calc" DataContext="{Binding CalcData}"/>

(in your page where CalcData is a property inside the MVVM class for the page)

Or if you are still using code behind:

  1. calc.DataContext = new CalcViewModel();

The result (be warned it’s not pretty but it works):



After finishing those types of controls the next step is making ground for the big guns => Tweaking the MainPage. I am sure many of you are shouting “How is that simple? Do you know how much code I have there?”. Well 1. this is a really small change and 2. I have ~700 lines of code in the MainPage at this moment (so just wait).

As I said we are going to do a minor change – adding a BusyIndicator control that wraps around the MainPage. The BusyIndicator control (for those who don’t know) is a control that makes it easy for the user to know when the application is busy by making all the controls in it readonly.

We will use the control with it’s two properties of IsBusy and BusyContent (you can use it with custom busy content as well).

We are going to implement something like this:


The end result is this:


The Locator class:

  1. public class ViewModelLocator
  2. {
  4.     public static IUnityContainer Container
  5.     {
  6.         get;
  7.         private set;
  8.     }
  10.     static ViewModelLocator()
  11.     {
  12.         Container = new UnityContainer();
  14.         Container.RegisterType<MainViewModel>(new ContainerControlledLifetimeManager());
  15.     }
  17.     public MainViewModel Main
  18.     {
  19.         get
  20.         {
  21.             return Container.Resolve<MainViewModel>();
  22.         }
  23.     }
  25.     public static void Cleanup()
  26.     {
  27.         Container.Resolve<MainViewModel>().Cleanup();
  28.     }
  29. }

(this class will be used to bind the MainViewModel in the MainPage since it’s the page that is loaded automatically, you can also add in the constructor a dummy view model in design time by using:

  1. //if (ViewModelBase.IsInDesignModeStatic)
  2. //{
  3. //    Container.RegisterType<IDataService, Design.DesignDataService>();
  4. //}
  5. //else
  6. //{
  7. //    Container.RegisterType<IDataService, DataService>();
  8. //}

which was taken from the MIX10 sample source code. I haven’t yet implemented it…)


  1.     public BusyMessage(bool isBusy, BusyReason reason)
  2.     {
  3.         IsBusy = isBusy;
  4.         Reason = reason;
  5.     }
  7.     public bool IsBusy { get; set; }
  8.     public BusyReason Reason { get; set; }
  9. }
  11. [Flags]
  12. public enum BusyReason
  13. {
  14.     NotBusy = 0,
  15.     JustFeelLikeIt = 1,
  16.     JustBecause = 2,
  17.     DoIRealyNeedAnotherReason = 4
  18. }

(the BusyMessage contains two properties the IsBusy and the Reason for being busy, Reason is a flag enum that allows you to add several reasons together)


  1. public class MainViewModel : ViewModelBase
  2. {
  3.     private const string DefaultBusyMessage = "Busy...";
  5.     public MainViewModel()
  6.     {
  7.         IsBusy = false;
  8.         BusyMessage = DefaultBusyMessage;
  9.         BusyReason = BusyReason.NotBusy;
  12.         Messenger.Default.Register<BusyMessage>(
  13.             this,
  14.             m => HandleBusyMessage(m.IsBusy, m.Reason));
  15.     }

(Pretty straight forward setting the default values and registering for the BusyMessage)

  1. private string GetBusyMessage()
  2. {
  3.     if (BusyReason == BusyReason.NotBusy)
  4.         return DefaultBusyMessage;
  5.     return StringUtils.EnumToSentence(BusyReason);
  6. }

(This method just converts the Enum to a readable sentence, I using another generic method that converts the enum to a sentence by first converting it to a string and then converts Pascal Case to regular text)

  1. private void HandleBusyMessage(bool isBusy, BusyReason reason)
  2. {
  3.     if (isBusy)
  4.     {
  5.         BusyReason |= reason;
  6.         if(!IsBusy)
  7.         {
  8.             BusyMessage = GetBusyMessage();
  9.             IsBusy = true;
  10.         }
  11.     }
  12.     else
  13.     {
  14.         BusyReason ^= reason;
  15.         if (BusyReason == BusyReason.NotBusy)
  16.             IsBusy = false;
  17.         BusyMessage = GetBusyMessage();
  18.     }
  19. }

(This method handles the message by using the BusyReason Flags attribute (|= adds a flag, ^= removes a flag))

  1. protected BusyReason BusyReason { get; set; }
  3. #region MVVM Properties
  5. private const string IsBusyPropertyName = "IsBusy";
  6. private bool _isBusy;
  8. public bool IsBusy
  9. {
  10.     get { return _isBusy; }
  11.     set
  12.     {
  13.         _isBusy = value;
  14.         RaisePropertyChanged(IsBusyPropertyName);
  15.     }
  16. }
  18. private const string BusyMessagePropertyName = "BusyMessage";
  19. private string _busyMessage;
  21. public string BusyMessage
  22. {
  23.     get { return _busyMessage; }
  24.     set
  25.     {
  26.         _busyMessage = value;
  27.         RaisePropertyChanged(BusyMessagePropertyName);
  28.     }
  29. }
  31. #endregion

(The properties being used, BusyReason is not used outside this class)



  1. <UserControl.DataContext>
  2.     <Binding Mode="OneWay" Path="Main" Source="{StaticResource Locator}"/>
  3. </UserControl.DataContext>

(this just bind the MainViewModel to our page since it is the first page being loaded)

  1. <controls:BusyIndicator BusyContent="{Binding Path=BusyMessage}" IsBusy="{Binding Path=IsBusy}">

(and this is the bound BusyIndicator control)


Now using it is as easy as:

  1. Messenger.Default.Send(new BusyMessage(true, BusyReason.JustFeelLikeIt));

(to activate the BusyIndicator with a message of Just feel like it)


  1. Messenger.Default.Send(new BusyMessage(false, BusyReason.JustFeelLikeIt));

(to deactivate the BusyIndicator with a message of Just feel like it)

And again the end result is this:


So, what do you think?


One last advice: watch the classes where you use Messenger.Default.Register because you will have to clean those classes. For example I had used it in a control to draw on a map when entering edit mode but forgot to clean it up, the next thing I know the drawings was done several times. Another drawback to not cleaning it up is in a possible memory leak since even after not using the class it is still being referred in the Messenger. To clean a whole class use:

  1. Cleanup();

To removed a certain message in the current class:

  1. Messenger.Default.Unregister<BusyMessage>(this);

(if you need to remove it from another class just replace the “this”)


After finishing this example almost everything else is quite easy. The one difficult thing I did encounter was switching the AutoCompleteBox to MVVM but I found a blog post or two that describes the process (we used the Populating event). And I hope it’s going to help me along… I will let you know how it goes later on.


Keywords: Silverlight, MVVM

IceRocket Tags: ,

Restarting SQL Server by code

For the past several months we have been having troubles with our DB. The IT can’t find the cause and the only application that doesn’t startup right is my team’s application because it uses ArcObject COM objects to connect to the SDE.

To investigate this I started by writing a simple coded unit test that tests the connection. The test wasn’t in any shape or form automatic, I had to stop it at a breakpoint and stop the DB service and then start it back and see if the application worked.

I decided I have to do this right and write an automatic unit test that restarts the local SQL Server service.

The code is fairly simple:

  1. private void RestartSqlServer()
  2. {
  3.     var controller = new ServiceController {MachineName = ".", ServiceName = "MSSQLSERVER"};
  5.     if (controller.Status == ServiceControllerStatus.Running)
  6.         controller.Stop();
  8.     controller.WaitForStatus(ServiceControllerStatus.Stopped, new TimeSpan(0, 0, 1, 0));
  9.     if (controller.Status != ServiceControllerStatus.Stopped)
  10.     {
  11.         throw new Exception("Couldn't stop SQL Server.");
  12.     }
  14.     controller.Start();
  15. }

line 3: Starting the ServiceController with the current machine and service name which was taken from:

Computer Management->Services and Applications->Services->SQL Server (MSSQLSERVER)->Properties:


line 5-6: we only stop the service if it’s running

line 8: we wait for a minute for the service to stop and if it doesn’t we throw an exception

line 14: starting the service again


That’s it!

Now to start those unit tests up…


Keywords: SQL Server, Unit Test, code, service

Thursday, March 17, 2011

Visual Studio 2008 hangs on Add Service Reference


The message in the Event Viewer:

Log Name:      Application
Source:        Application Error
Date:          17/03/2011 17:16:42
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      local
Faulting application name: devenv.exe, version: 9.0.30729.1, time stamp: 0x488f2b50
Faulting module name: KERNELBASE.dll, version: 6.1.7601.17514, time stamp: 0x4ce7b8f0
Exception code: 0xe053534f
Fault offset: 0x0000b760
Faulting process id: 0x1b2c
Faulting application start time: 0x01cbe47ca767c4ca
Faulting application path: C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\devenv.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 900eb868-50a9-11e0-9c2d-001f161db0b8

Every time I tried adding a service reference to a 4.0 WCF service this window popped and VS restarted. It only happened in one project which was a .Net 3.5 class library (in another such project the reference was added without a problem) and Add Web reference worked fine…

It was on a proof of concept deal so I didn’t want to get into the hassle of creating the proxy by hand (especially since the service’s code was in one TFS server and the client code was on another TFS server – couldn’t just add a project reference).


The solution in the end was pretty simple:

Start “Visual Studio 2008 Command Prompt” type:

svcutil http://localhost/ProjectName/SomeService.svc


Take the code generated and add that to your project.


Keywords: Error, VS2008

IceRocket Tags: ,

A Visual Studio Tip

(Not a tip of a day, I am not that or that crazy…)

Works on Visual Studio 2008 and 2010.

I don’t know about you guys and gals but I like to listen to music while I program using:


(Shoutcast has lots of stations so I never grow tired of the songs and it’s free)

and I also like to browse the web while I am building my solution.

Sometimes I get absorbed in reading posts in Google Reader while waiting and waste time reading when I am supposed to work. My first solution was changing the browser size so that it took ~90% of the screen (leaving the bottom showing the Visual Studio notification area) – but sometimes I just wasn’t looking and it didn’t help…

My second solution and the one I am advising for people who use headphones is changing the Windows Sounds for Visual Studio. It is very easy and just makes the build process more enjoyable*.

How to:

Control Panel –> Change System Sounds:

Find the program Microsoft Visual Studio and add sounds for Build Failed and Build Succeeded:


(I also had another program without a name with the categories Build Failed/Succeeded I added the sound there too, maybe it’s for 2008?)

Now just put the sound in your projects folder then you can change them whenever you want…

Restart Visual Studio – important step or it doesn’t work.


* The build process is more enjoyable to me because I added funny sounds:

Can you really tell me you’ll be upset after hearing Bugs tell you “you’ve got troubles…”?

Please comment and tell me what sound effects you choose for your build.


Keywords: Sound, Visual studio, build

IceRocket Tags: ,,

Monday, March 14, 2011

Semantic Similarities

For the last year I have been working on my final project for my Masters Degree in Computer Science. My college, the Academic College of Tel-Aviv-Yaffo, doesn’t employ a Thesis but uses a combination of a Final test (with the material of core subjects from both the Bachelor and the M.Sc. degrees) and a final project worked on with one of the Doctors/Professors in my college. The project I am working on is on semantic similarities with Professor Gideon Dror and I am nearly done, all that is left is to present my work in front of my professor and a faculty member.

I have decided to first present my work here and then actually do the presentation.

The project was done mostly in Python (which before hand I had no knowledge of) and it’s first part was done as a possible contribution to the NLTK library.

The first part of the project was about implementing methods to find semantic similar words using an input triplets of Context Relation. Context Relation triplet are two words and their relation to each other extracted from a sentence.It was shocking to find in the end that NLTK hasn’t implemented a way to extract Context Relations from a text (they have a few demos done by human hand) and it seems that to implement this a knowledge linguistics that I just don’t posses.

The second part of the project was to extract the Semantic Similarities of words from the web site Yahoo Answers. The idea is that with enough data extracted from different categories an algorithm can be used to determine the distance of the words.


On to the presentation:



For this discussion we will ignore the part “without being identical”. In this project identical is included in similar.


Are Horse and Butterfly similar? The first response should be of course NO, but of course it depends comparing horse to butterfly to house reveals that horse and butterfly are similar it just depends on the context…


Likewise comparing a horse to a zebra the response would be YES. But looking at a sentence such as:

The nomad has ridden the ____

and looking at horse, zebra and camel which is more similar in this context?


This time the only similarity in these words are the way they are written and pronounced. Their context relation should be very dissimilar no matter the text. But imagine using a naive algorithm that only counts the number of words in a text, is it really inconceivable to have a close number of occurrences of these words?


Humans use similarities to learn new things, a zebra is similar to a horse with stripes. But it is also used as a tool for our memory, in learning new names it helps to associate the name with the person by using something similar. For example to remember the name Olivia it could be useful to imagine that person with olive like eyes.

In software the search engine use similar words to get greater results, for example a few days ago I searched for a driver chair and one of the top results was a video of a driver seat.

Possible future uses for similar words could be in AI software. There is a yearly contest named the Loebner Prize for “computer whose responses were indistinguishable from a human's”. If we could teach a computer a baseline of sentences and then advance it by using similar words (like the learning of humans) it could theoretically be “indistinguishable from a human's”.

Imagine having the AI memorize chats, simply by extracting chats in Facebook or Twitter. Then have the AI extend those sentences with similar words. For example, in a real chat:

- Have you fed the dog?

Could be extended to:

- Have you fed the snake?

(some people do have pet snakes… and I can imagine a judge trying to trip an AI with this kind of question…)


A simple definition is if we had a sentence containing word and we replaced word with word’ and the sentence is still correct the words are Semantic Similar. From now on Similarity is actually Semantic Similarity.


From the examples we can see that Similarity is all about Context, Algorithm and Text.


As we could see in the examples the Context of the words makes a large difference whether or not two words are similar. Unlike Algorithm and Text, it has nothing to do with the implementation of finding the similarity.


Some Algorithm use Context Relation to give value to the context in which the words are in. Extracting Context Relation from text is a very complicated task and has yet to have an implementation in NLTK, the library does have a couple of examples that were created by human means.


Looking at the all the words with the distance of 4 words from the word Horse. One of the Algorithms we will examine use this as a simpler Context aspect for the Algorithm.


Another form of Context extraction is separating the text based on category. Then each category adds a different Similarity value and those can be added together.


Algorithms that ignore the Context of the word are therefore less accurate than those that do but they are also more complex. It can be simply because they use Context Relation (with it’s complex extraction) or using a words radios which just mean individual work for each word – more complexity.

All the Algorithms use some form of counting mechanism to determine the Similarity/Distance between the words.


Depending on the Algorithm a different scoring is done for each word. The the Algorithm determines how to convert that score into the Distance between the words, which just means calculating the Similarity.


Text is a bit misplaced here because it is a part of the Context and is used inside the Algorithms. Choosing the right text therefore is as essential a part as choosing the right Algorithm.

But imagine a text that contain only the words:

This is my life. This is my life…

All the practical Algorithms shown here will tell you that “this” and “life” are Similar words – based on this text alone.


In my second implementation of Similarity Algorithms I used extracted text from several categories of Yahoo Answers. Yahoo Answers is a question+answer repository that contains thousands of questions and answers. For my Algorithms I had to extract 2GB of data from the site (just so I had enough starting data).


The Algorithms can be separated to two groups: those that use Context Relation (and therefore until an extractor for Context Relation is implemented are purely theoretical), and those that use Category Vector as a form of Context for the words.


All the Context Relation Algorithms use this two inner classes: Weight and Measure. Weight is the inner class that give a score for the Context Relation, the Weight is important since a Context Relation that appears only once in a text should not have the same score as one that appeared ten times. The Measure inner class calculates the distance between two words using the Weight inner class. Using only this classes the user can be given a Similarity value of two words.

The Algorithms in this section implement a near-neighbor searches. We use them to find the K most similar words in the text not just how similar the words are.


Taken from James R. Curran (2004)-From Distributional to Semantic Similarity

In my Theoretical work I implemented some of the inner classes of Weight and Measure from James R. Curran paper From Distributional to Semantic Similarity.


I am not going to go into lengthy discussion on how they work because the paper discusses all of this.

I am going to say that the Similarities turn out different for each combination of Weight X Measure and that it is fairly easy to set a combination up or to implement a new Weight/Measure class.


The classes I choose to implement taken from Scaling Distributional Similarity to Large Corpora, James Gorman and James R. Curran (2006). This Classes are used to find the K most similar words to a given word.


The simplest algorithm is a brute force one. First we calculate the Distance Matrix between our word and all the rest of the words in the text and then we search for the K most Similar words.

The disadvantage for this algorithm is that calculation for finding the K-nearest words for “hello” can’t be reused for the word “goodbye” (actually only one calculation can be reused here and that is between “hello” and “goodbye”).

I am not going to go into the other implementations here since they are more complex. I might write another post in the future about those algorithms.

If you interested the Python implementation can be found here (or you can just read Scaling Distributional Similarity to Large Corpora).


There are two practical Algorithms that I have implemented.


This simple algorithm is very fast and can be preprocessed for even faster performance. By simple saving the count of each word per category, the Algorithm can be made as fast as reading the preprocessed file. In small examples of just 50MB data the Algorithm took only a few seconds to extract a result. Using the full data of 2GB it takes ~10 minutes to have a result for ~350 pairs of compare words. Though because of the large amount of data the data must be opened in chunks (a chunk per category) or an Out of Memory Exception is thrown.


The end of the Algorithm is identical to the first Algorithm but where the words radios Algorithm has clearly more vectors. Not only that preprocessing of this data is both time consuming (takes ~5 days) but also space consuming (from 2GB to 15GB) – just preprocessing the data caused at least 10 Out of Memory Exceptions (Python doesn’t have an automatic Garbage Collection so after every category I had to call gc.Collect() manually).

The calculation time for ~350 pairs of compare words was ~25 hours, which of course can’t be used in real time AI conversations. Though with the preprocessing it doesn’t matter if there are 350 or 35k words to compare – it will take approximately the same time. For example three categories of ~120MB with ~350 pairs take ~56 minutes but 3 pairs take ~30 minutes.


It’s important to note that both Algorithms have close result, for example bread,butter has a Similarity of 0.56 which is pretty high.


As can be seen the Basic has almost always greater result than Words Radios. Not only that Basic has some weird result such as Maradona is more Similar to football than Soccer though in many places (not USA) use them as synonyms, whereas Words Radios seem to think soccer is more similar to football.

Since the Words Radios actually uses a form of Context Relation (though not very lexical) it is considerably more accurate.


Remember I claimed it was all about the text? well this results were done with just a few categories and suddenly Arafat is similar to Jackson, how weird is that?

Another difference is the calculation time the Simple Algorithm takes ~17 seconds where the Words Radios Algorithm takes ~56 minutes.

BTW remember night and knight? Well the simple algorithm returned 0.79 Similarity for those 3 categories… And the Words Radios returned 0.48 Similarity.


If you want to read more about all of this subject here are some available books:

Natural Language Processing with Python

Foundations of statistical natural language processing

Python Text Processing with NLTK 2.0 Cookbook

Online Resources:

(Semantic) Similarity-Blog – a blog with past research on Semantic Similarity (unfortunately it seems not to be updated)

Google Ngram Viewer – Google have a free tool that returns a Graph with the number of occurrences for each word by year. When the graph has the same shape you can assume the words are Similar. For example for the words dad, mom the graph is:


And this is for peace, war:


Give it a try – the results are returned in less than 3 seconds…


So do you have any questions? Suggestion? Too long? Too short?

Tell me what you think…


Keywords: similarity, NLTK, search, AI

IceRocket Tags: ,,,