Random Surfing with a twist 
[Mar. 28th, 200809:51 pm]

The famous random surfer model of pagerank (the not so secret sauce of google) is that you start surfing the net from a random point. Then every time you either follow one of the links on that page or you jump to another random page on the net. You keep on doing this indefinitely and probability of you landing on a particular page is its pagerank. Now of course this intuition needs to be quantified and turned into mathematical equations to compute the pagerank. This is how it is quantified. You have a coin which lands head with probability p and tails with 1  p. Everytime before going to next page you toss this coin. If it lands head you follow one of the links on the current page and if it lands tail you jump to a random page. Now a page might have many outlinks say k. So how do you decide which outlink to follow? You just select one of those links with equal probability 1/k. Also when you jump to a random page you just pick one of the pages with equal probability 1/n (where n is the total number of pages). We will define a teleportation vector which is the probability of jumping to each page, in this case all elements have same value 1/n.
So let us assume Google uses this mathematical model, does heavy number crunching and computes the pagerank of each page. Now what your random surfer does is that he doesnt jump to a random page with equal probability. Instead whenever the coin lands tail, he goes to Google, enters a random query and goes to the first result shown by Google. Now what is the probability of jumping to a page. It clearly is not 1/n because a page with high pagerank would have migh higher chance of being jumped to. It also depends on the set of queries for which a page is returned. But to simplify our model we assume that jump probability to a page is proportional to its pagerank. So now our teleportation vector will have the ith entry as (pagerank of i) / (sum of all pageranks). Now Google uses this vector and recomputes to get a new set of pageranks. But this recomputation once again changes the teleportation vector thus changing the pagerank thus changing the teleportation vector thus changing the pagerank and so on.
So is there a fixed point for this iteration? Is this problem interesting? Is there some work on this? 


Partitions and Compositions 
[Mar. 24th, 200809:26 pm]

A puzzle recently asked by a friend led me to consider various forms of partitioning a number. There are three ways essentially and for two of those I could find nice closed form solutions for number of partitions and for the third I could only find a recurrence.
1. Composition: A kcomposition of a positive integer n is k positive integers which sum up to n. So for example a 2composition of 4 would be 1+3 or 2+2 or 3+1. Notice that 3+1 and 1+3 are different compositions, so order of the k numbers matter. To find out the number of kcompositions, consider a scale on which we have marked numbers from 1 to n. Now we have to break the scale into k integral parts to form one composition. So a scale of length 4 can be broken at 1,2 or 3 to get different 2compositions. Breaking the scale means selecting k1 points at which we break out of n1 possible break points. This can be done in (n1)C(k1) ways which is our desired answer.
2. Weak Composition: This is same as composition except that now some of the parts can be of 0 length. 3Weak Compositions of 4 would be 0+1+3, 1+0+3, 2+0+2 and many more. Now how do we find number of kweak compositions (referred as kWC from here on). One natural way to think about this (the direction I first went in) is that removing all 0s from a kWC would give k'C where k'<=k. In fact for each k'C we can put (kk') 0s to form a kWC. In how many ways can we insert these 0s. We have to pick (kk') positions out of k positions where we will put zeros and this can be done in kCk' ways. There total number of kWC formed from k'C would be kCk' * (n1)C(k1). Adding this over all k' from 1 to k gives us Sigma(i=1 to k) kCi*(n1)C(i1).
But this is not a satisfactory solution and I do not how to manipulate this summation to simplify it further. So I started looking for a different way to arrive at solution. After lots of head scratching, suddenly out of nowhere this fact hit me. If I add one to each number in kWC of n I would get a kC of n+k. This is because in total I am adding k and also none of the numbers would now be 0. For eg adding 1 to each number in 0+1+3 I get 1+2+4 which is 3C of 7. After realising this fact it is not too difficult to prove that there is one to one correspondence between kWCs of n and kCs of n+k. And we already know the formula fo kCs of n+k which is (n+k1)C(k1). This is also the formula for number of kWCs of n.
3. Partition: This is same as composition except that now the order of numbers doesnt matter. So 1+3 and 3+1 are same partitions. I could not find a closed form solution for number of kpartitions but I could find a nice recurrence relation which can then be converted into dynamic programming algorithm. If 1 is the smallest number in a partition then remaining k1 numbers add to n1. Therefore number of such partition is same as (k1)Ps of n1. But if the smallest number is greater than 1 then what do we do. Well we can subtract 1 from each number to get a valid kP of nk. For eg subtracting 1 from each number in 2+2 gives me 1+1 which is a 2P of 42=2. Therefore number of such partitions would be same as number of (k)Ps of nk. So total number of kPs of n is (k1)Ps of n1 + kPs of nk. This could be easily converted to a DP algorithm with O(kn) space and time complexity.
Exercises for readers: a) Why wouldnt this same recurrence work for Compositions? c) Can you find a better algorithm for Partitions? 


Dependency Injection 
[Dec. 28th, 200702:45 pm]

I have been reading a bit about Guice, a dependency injection framework by Google. Here is a small introduction of dependency injection in general which I thought of just to clear my own understanding. Most of the terms and concepts here have been borrowed from this Guice document just reworded.
1. Prelude We have a class called Employees that return me the details of an employee with a particular Id. Now the employee details are stored in a database so Employees will need to open connection to database and fire a sql query. We might write it like this
public class Employees { JdbcDriver driver; public Employees() { driver = new MysqlJdbcDriver(); }
Employee getEmployee(int id) { driver.query("query to ger employee with id"); // populate Employee return Employee; } }
On one hand this is a good design because it hides the fact that things are store in a database from its users. But this comes at the cost of flexibility. Tomorrow if my records are moved to a a different type of database then I will have to change the souce of Employees. Bigger pain this causes is in unit testing. Only way to test this class as it is now is to create a test database, populate it with some sample fields and then test it. But then my tests can fail due to database errors even though the logic in this class is fine and they will be slow because of database queries. Ideal way to unit test such classes is to mock all the dependencies on external factors such as database in this case.
2. Dependency Injection That is possible only if the class allows us to set the driver either in its constructor or through some setter. Then we can pass it a dummy implementation of JdbcDriver. So we rewrite our constructor as
public Employees(JdbcDriver driver) { this.driver = driver; }
Now while creating Employees we can decide what kind of driver we want to pass to it. This is known as dependency injection where a class instead of creating its dependencies has them injected by the caller. Now this leads to two issues:
1. Lots of clutter in your caller such as to create an object of type A we might have go write something like this:
C c = new CImpl(); D d = new DImpl(); E e = new EImpl(); B b = new BImpl(c, d, e); A a = new AImpl(b);
2. Even though all we cared about is A we ended up knowing and importing B, C, D, E. This information should have been hidden from us.
3. Dependency Injection using Factory One way to solve this problem is by using factory. Factory as the name suggests is a Class that knows how to manufacture objects of a certain type. So AFactory would know how to create objects of type A.
public static class AFactory { public static A getInstance() { B b = BFactory.getInstance(); A a = new AImpl(b); return A; } }
This way all the clutter is moved to a different class away from your main logic. But well even though it is hidden the clutter is still there in your code leading to an indirect compile time dependency from Caller to B,C,D and E. And you might end up creating lots of these boiler plate factory classes. And if you look at the code of these factories they are all very similar and you would think it should be easy to automatically generate these.
4. Frameworks This is where dependency injection frameworks come into picture. Essentially what a framework does is that it allows you to specify all the objects you require and there corresponding dependencies either outside of code in a format like XML (as in Spring, a popular framework) or inside code but still in more of a declarative manner using annotations etc (Guice). Then when you ask the framework to give you a particular object it will first create all the dependencies for that object, then create that object and return it to you. Ofcourse to create dependencies, it might have to go one level further down and create dependencies of dependencies and so on recursively. Using a framework I can layout my whole dependency tree in a separate place and ask the framework to give me any node of that tree and it will oblige. So you can say in a separate configuration file (similar to Spring's xml config file):
id = a class = AImpl depends on b
id= b class = BImpl depends on c,d,e
id = c class = Cimpl id = d class = Dimpl id = e class = EImpl
and then in your code say: A a = MyDependencyframework.giveMe("a");
This will work like a charm. All you clutter has gone into a separate configuration file thus preventing any boilerplate from bloating your main logic. Also note that now your main class has to depend only on A at compile time. Another way of thinking about the utility of dependency frameworks is thus. An elegant but extreme way to design any object oriented system would be to first have an initilisation or bootstrapping process where all the primary objects that will be required in the lifetime of of the system are created and kept in the memory. Then the main logic takes over which involves interaction between these myriad objects. Ofcourse there will be many temporary objects that will be created and destroyed during the main loop such as Strings, Callbacks, Exceptions, Filewriters/readers etc but all the bulky, juicy objects containing the meat of your logic are created in the bootstrapping stage. Dependency frameworks basically allow us to specify this bootstrapping process in a simple way, separate from the main logic and then make it happen.
Next time I will perhaps try to write brief introductions to Spring and Guice and there comparison. 


Elegance of haskell 
[Nov. 15th, 200708:14 pm]

I did a functional programming course at IITB where we were taught Haskell ad pretty much fell in love with the language. It allowed me to write small, elegant yet very expressive programs. By expressive I mean by looking at the code you would be able to understand easily what its doing, especially a computer engineer who is well versed with the concept of recursion. Sometime back I was trying to think what is it that made programs in haskell so elegant. I came up with following reasons.
1. No side effects. What this means is that the only effect a function can have on the state of the system is through its return value. Why does this result in more elegant programs. Becase this allows us to compose functions easily. An example: Lets say you want to sort the array and then find the median element in it. In C a typical programmer would righ like this sort(a) median(a)
where a is a pointer to the array. In C sort function would usually modify the passed array rather than returning a new sorted arrat.
In haskell you would right it as median(sort(a)). And you can continue it for as many levels as you want.
square(median(sort(a))) etc etc
2. No loops. You cannot do loops in Haskell. I have always found loops to be very inelegant expecially that horrible for loop line. It is just boiler plate code. So in Haskell you are forced to use recursion everywhere. Example: In C for (int i = 0; i < lenght; i++) sum = sum + a[i]
In haskell sum x:xs = x + sum xs
An aside: This is why I love the foreach construct in Java so much. It is so succint and elegant yet so clear. No longer do I have to worry about those horrible .next(), .hasNext() etc etc.
3. Lists are the basic data structure in haskell which in C and other languages arrays are not that native. Things like list comprehension make your program so much shorter and elegant. Example: To create s subarray containing only positive elements In Java (even after using the foreach construct) List<Integer> positives = new ArrayList<Integer>() for (Integer number: inputList) { if (number.intValue() > 0) { positives.add(number); } }
In Haskell positives inputlist = [y y< inputlist, y > 0]
4. Ofcourse Functions being first class objects which can be passed around, comosed, have some of there arguments specified to form new functions etc etc. 


Idle thought about Websites as services 
[Nov. 15th, 200707:22 pm]

Surfing net over a datacard is a convenient but also a frustrating experience. The crawling bytes remind of the dial up days which (after last 1 year on broadband) seems like last century. But what has surprised and exasperated me the most is that even the static content on most of sites that I frequent such as reader, yahoo, gmail etc also takes up much of the bandwidth each time. I have no clue why is it not cached by the browser? My vision of these web sites/services is that only thing that should usually go up and down the wire should be data. With HTML pages it was not easily doable as the formatting information and the actual data were too tightly coupled usually with the same file containing both. With css and javascript it became easier to separate some of formatting and event handling from the data. Now with the proliferation of AJAX, I think we have finaly reached the stage where data can be completely freed up from the whimsy of software. That is the usual model in traditional softwares and now even in websites we have reached the stage where we can achieve such seperation. I would like to see every single site as completely ajaxified with hardly any static html and all the data being got though asynchronous calls. First time a site is loaded on your browser a bundle of javascript + css (maybe huge) is got over the wire and cached on your machine. This is the software. From now onwards this javascript will alwyas be loaded from you machine and it will make AJAX calls to get just the data. These thoughts then led me to a different but related direction. Typically any application has three layers. One is the dumb data layer which just takes care of storage of data and its retrieval such as file system, database etc. Above it is the main business logic layer which understands the data, knows how to process it, interpret it, marshalls the data around, puts it in right place. And third layer is the user interaction layer which presents this data to the user and interprets there action to modify the data. Currently if you look at any of the big sites they have there homegrown solutions for all three of these tiers. But why should this be so? Why cant we have organisations existing at just one of the level of the ecosystem. To take an example: 1. We have companies that just provide scalable distributed storage systems of various kinds such as Amazon S3. But I would like to see many varied stroage solutions like Bigtable, database etc. 2. We have companies that provide an rss feed subscription service which manage all my subscribed feeds. It know what items I have subscribed to, what have I read on what feed and so on. It uses one of the companies at level 1 to store actual data. But it doesnt do any UI of its own, it just exposes APIs for people to be able to access it. 3. We have large number of feed readers which you can point to any of the service at level 2 and they wil pick up your subscriptions from there. And I can move form one reader to another yet not losing any trace of my feed consumption so far.
I think there are various advantages of this. Most important I think is the fact that people with innovative UI ideas will not have to worry about te issues of actually building the underlying datastore that powers that UI. We will see much more choice, innovation at Level 3. We are moving in that direction with most big and small players releasing web service APIs to there data but I think many of these are crippled and will not actually allow you to build say a competitor to Amazon.com using there own data. But I hope we will reach there at some stage. I havent thought about the business model for the companies at each of the three levels. But I think a Level 1 you will have something similar to S3 where you have to pay for the storage. At Level 2 and 3 I think you can share ads revenue. 


A really nice puzzle 
[Jul. 16th, 200712:10 am]

In a room, a row of 100 randomly ordered cards are placed with their face down, each has a unique number between 1 to 100 (= random permutation). 100 players (also numbered 1 to 100) are allowed to the room one at a time. Each player can peek at 50 cards  looking for his own number. The players can't communicate or transfer any kind of information after the game started, but they agree in advance on a strategy that will guaranty a surprisingly high probability that *all* of them will see their numbers. What is the strategy?
Some clarifications: 1 Player who goes inside the room can only peek at 50 cards, but he must leave the room in the state as it was when he entered it. Which means that he cannot rearrange the cards or leave some card faced up. 2. This is not a trick question which means that there really cannot be any kind of information transferred between two players once the game has started. Which rules out all such solution as the time he spends inside the room gives some clue to others. You got the point. 3. I am not asking for a solution that will maximise the number of peaople who see there card. I am looking for a strategy that will give a high probability for all of them seeing there number. B high we mean something around 2030%. Why is this high? Because just randomly opening cards will only give them a probability of (1/2^100) of all seeing there card.
It is quite amazing that just by cleverly peeking at the cards you can do so much better. A really Gem of a puzzle. Source “7 Puzzles You Think You Must Not Have Heard Correctly”. by Peter Winkler. 


Generating the entire season of EPL with just the final League Table 
[Jun. 11th, 200709:41 pm]

Football season has just ended and you are given the final league table which shows for each team how many matches it won and how many it lost (assume that there are no draws). What it doesn't show is the outcome of individual matches. So various question that arise are:
1. Generate a possible set of outcomes for all matches which result in this table. 2. Enumerate all such sets and/or number of such sets efficiently. 3. Given the table what is the probability that Arsenal beat Chelsea.
So for example if we had 3 teams each having won one match exactly then there would be two possible set of outcomes. 1. A beat B beat C beat A 2. A beat C beat B beat A On the other hand if one team had won 2 matches and one had won 1 match then there would be only one possible set of outcome.
I dont know if these questions have interesting/elegant answers. I haven't been able to come up with anything good so far for questions 2 and 3. Question 1 though is easy. I have also tried to formulate this in terms of some graph theroretical problem but no luck so far. I think these kind of constraints where you say exactly K events should happen (in this case Team A should win exactly K matches) are harder to model.
Any thought? 


GWT and Javascript 
[Jun. 1st, 200709:13 pm]

Past couple of months I have been developing a UI using GWT. For those who have not heard of this, it is a toolkit which allows you to write your ajaxified web application in a restricted Java and converts it to javascript automatically. It is pretty neat, provides a swing like api and RPC mechanism for communicating with servlets. But there are certain lessons I have learnt the hard way while using it. Although these are not specific to GWT but are more about Javascript and browser quirks, but its easy to ignore these when you are programming in GWT as you are not directly dealing with javascript and browser. But believe me sooner or later they will come to hit you hard. These are
1. Manipulating DOM is expensive. I had to create a table having O(1000) rows on the fly and it just killed the browser. User starts getting those "slow script" warnings which make for very bad user experience. One way I used to avoid this was to just render that portion of the table which is visible currently and when user scrolls catch the scroll event and render the newly visible rows. Another way would be to split your rendering script into multiple tasks each one executed on click of a 1ms timer. This will not decreasing the rendering time but you will not get those annoying warnings.
2. All browsers have a limit on number of HTTP connections that they open to a single domain. And by default this limit is set to 2. What this implies is that if you have some asynchronous calls which take a long time to return, then all your others call will be held up because of this. Browser will queue them and only when the pending calls return it will send them. To avoid do not ever have calls that server takes long time to serve. Trick I used for this was that my call would just schedule the computation on server (in a separate thread on server) and would return immediately saying that computation has been scheduled. Then i would periodically check for the status and when the computation is over server would return me the result.
3. This is specific to GWT. GWT does lots of error checks and other sanity checking stuff on many DOM manipulation calls. These can sometimes slow things down noticably. Keep an eye out for these and whenever necessary instead of calling GWT function write your own function that avoids all those checks and just does the crux , which you can copy paste from the corresponding GWT function.
4. Even though GWT takes care of most of the browser specific quirks there are still some places where you will have to handle different browsers differently. For example getting tops of a scroller and the widget being scrolled return different values in different browsers. And in safari getting top of a table row returns wrong value (this is a safari bug), a neat way to take care of this is to take top of any cell in that row, which returns the correct value.
But all in all it was a really nice experience using GWT. Give it a shot!! 


Algorithm for Rooting a tree 
[Jun. 1st, 200708:44 pm]

Given an acyclic graph, can you root it (that is find a node which we call the root) such that the height of the tree is minimum. Here by height of the tree we mean maximum height of a leaf node, where height of a node is distance from root node. O(n^2) is straightforward. Can you do it in linear time? I have a solution which I think is correct. I will post it later. What if we want to minimise the average height of the nodes? 


navigation 
[ 
viewing 
 
most recent entries 
] 
[ 
go 
 
earlier 
] 
 

