Thursday, April 30, 2009

Portfolio Assignment 9

For our team’s final project we decided to build off of the in class lecture on the Prisoner’s Dilemma. We were excited by the in class battle of strategies, where the teams that scored the most points against the other teams won. When deciding what direction to take our project we decided to use the Genetic Programming we leaned in class and apply it to creating strategies for Prisoner’s Dilemma. In class we learned about Genetic Algorithms leading to the development of successful strategies in checkers, and we decided that a similar approach could be used for Prisoner’s Dilemma.

Prisoner’s Dilemma is explained thoroughly at http://en.wikipedia.org/wiki/Prisoner's_dilemma, for our game the payoffs are: (1, 1) (5, 0) (0, 5) (3, 3).

For the purposes of our project we decided that a game will consist of 20 rounds where the prisoner can decide to cooperate or defect. On the first round the prisoner will not know what the other prisoner is doing, on the second round the prisoner will know what the other prisoner did in the first round, on the third he will know the results of the first and second round, and on the fourth he will know the results of the first, second, and third round. For every round after that through 20, the prisoner will only remember the results of the most recent three rounds.

The strategy employed by the prisoner will be hard coded before the start of the game, so a strategy describe the action (cooperate or defect) of the prisoner for the 1st, 2nd, 3rd, and 4th-20th rounds depending on the actions of the other prison in the previous rounds. This strategy will be stored in a binary string that has a 0 or 1 (cooperate or defect) for the 1st round, 2nd round depending on the 1st, 3rd depending on the 1st and 2nd, and the action to take for the rest of the rounds depending on the past three rounds.

Our program creates 100 random binary strings that represent strategies and then those strategies are placed on a 2-dimensional grid, represented by colored squares. Each square or strategy plays a 20 round Prisoner’s Dilemma with the squares bordering it. The total points accumulated in the games it plays are stored and then compared to all the other strategies. Top performing strategies are kept, and new strategies are created by the mutation and cross-over of the top strategies.

Lastly each square is given a “niceness” value depending on the total number of scenarios the strategy will cooperate in. This number then corresponds to a color which the strategy’s square is colored in on the grid; nicer strategies are lighter colors while meaner strategies (more likely to defect) are darker colors. This creates a visualization so that users of the program can see dominant strategies win and take over parts of the grid (with their color) as the program runs.

I think our group project was a great success and really showed off how much we had learned about genetic algorithms as well as the Prisoner’s Dilemma. I think that one weakness of our final project was that after a while most of the strategies become similar and therefore are playing predominately similar versions of themselves. An easy fix to this would be to randomly insert newly generated strategies into the grid every so often to test the winning strategies against vastly different strategies.

Saturday, April 25, 2009

Portfolio Assignment 8

Chapter 11 Evolving Intelligence
Genetic Programming


Genetic programming is a branch of genetic algorithms, the main difference between genetic programming and genetic algorithms is the representation of the solution. Genetic programming creates computer programs while Genetic algorithms create a string of numbers that represent the solution.

The first step is to generate an initial population of random compositions of the functions and terminals of the problem (computer programs). Then execute each program in the population and assign it a fitness value according to how well it solves the problem. To create a new population of computer programs:
– Copy the best existing programs
– Create new computer programs by mutation.
– Create new computer programs by crossover (breeding).

Most programming languages when compiled or interpreted are first turned into a pars tree. The terminal set consists of the variables and constants of the programs. The functions are several mathematical functions, such as addition, subtraction, division, multiplication and other more complex functions. Create a ROOT node with a random associated function and then create as many random child nodes as necessary (Children may have children).

Ranking functions create new population of computer programs; each function generates a score on how well is does and they are ranked from best to worst. The best programs are copied for the next population and new programs are created through mutation and crossover of the different best functions .

Something that is often overlooked is the importance of diversity: Choosing only the very best leads to populations extremely homogeneous, crossover will not affect the population because it leads to more of the same . Allowing completely new programs to be added to the mix will increase the chances of finding new and better combinations of functions, and regardless the worst programs will always be eliminated eventually

Portfolio Assignment 7

Programming Collective Intelligence
Chapter 4: Searching and Ranking


Chapter 4 covers full-text search engines, which allow users to search large sets of documents for a list of words and then ranks results according to relevancy.

The first step in creating a search engine is to build a big collection of documents. This is commonly done by crawling the web or spidering. You start with a small set of pages and follow the links on all the pages to find more and more documents. The next step is to index the documents. This is done by breaking all the unstructured documents into words and their location (path or URL). The index is a list of all the different words along with their locations in the documents. In this section the author uses SQLite to store the index, in an embedded database. I downloaded the code from the author's website and put the searchengine.py and nn.py in my Python/Lib folder. I downloaded the .py files for Beautiful Soup. Installing it was just a matter of copying the .py files to the Lib folder in Python. Next, I installed pysqlite and downloaded the searchindex.db file which was linked to in the book. The built-in page downloader library, urllib2 downloaded an HTML page and can print out characters throughout the page, at different locations and ranges. urllib2 used in combination with BeautifulSoup will parse HTML and XML documents. Adding the page and all the words to the index will create links between them with their locations in the document. The python code to do this allows the crawler to index the pages as it goes, however this causes it to take a long time to run. Returned is the list of all the URL IDs containing “word,” resulting in a successful full-text search. The code for Querying and content-based ranking was added to the searcher class resulting in new functionality, namely ability to query and rank based on factors of normalization, word frequency, document location, and word distance. Finally the chapter introduces the SimpleCount and PageRank algorithms. All in all chapter 4 was very successful in introducing and then expanding my knowledge of both searching and ranking using a variety of different approaches that have their own advantages and disadvantages.

Portfolio Assignment 6

Clustering IMDB

My group used the plot data from IMDB for our clustering of movies. The file had some header information followed by the data. The first step to clustering the data was to truncate the plots.txt file. The original file has over 2 million lines of text. To separate and count every word of that file would take ages. To truncate the file, we created a function, that goes through every line in file and collects all the movie names into a list. The second step was to take all the movie names and select 1,000 random ones. Since we have kept the order of all the movies in the file in the names list, we just created a list of random indexes with a length of 1,000 and sort the random indexes. Then when we ran through the whole list of movies again, we only stoped when we were at the nth item in the random numbers list and began copying that movie to the truncated movie file. This makes the process of selecting the random movies out of the whole file required only one pass through the file. When returned the list contained exactly the number of movies specified with no duplicates.


Creating the Word Vector

The next was to create the word vector. To do this, we dissected out all the lines of text (once again) and breaking each line of text into words. The words are then counted and added to the count of the specific movie name in the dictionary, wordcounts. This is the code we used:

def createmovievector(truncfilename="%s/truncplots.txt"%root,outfile="%s/movievector.txt"%root): apcount={} wordcounts={} movies=[] fi=codecs.open(truncfilename,'r',codec) for line in fi: if line.startswith('MV:'): moviename=line[4:-1] movies.append(moviename) elif line.startswith('PL:'): wc=getwordcounts(line[3:]) wordcounts.setdefault(moviename,{}) for word in wc: wordcounts[moviename].setdefault(word,0) wordcounts[moviename][word]+=wc[word] for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 fi.close()

The next code segment collects the list of words that appear within the range of frequency defined.

wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(movies) if frac>0.002 and frac<0.2: out="codecs.open(outfile,'w',codec)">


Clustering the Data

After the movie vector is created, you can read the file in using the clusters.readfile() function. So here is the code we used to create the clustered movie text:

movies=truncmoviefile()createmovievector()movienames,words,data=clusters.readfile("%s/movievector.txt"%root)clust=clusters.hcluster(data)printclust(clust,codecs.open("%s/clusters.txt"%root,mode='w',encoding='mbcs'),movienames)


Conclusions

After looking at the resulting file, you may notice that most of the movies are not popular or known by anyone. This is because IMDB an enormous number of movies stored in its database. Our decision to choose 1,000 random movies made the results difficult to evaluate because of our unfamiliarity with the movies in our clusters. Another issue we noticed is that the keywords for which the movies are clustered on are mostly character names. This has the effect of clustering movies based on what character names the movies have in common. In effect, if you could recognize the movies that were clustered, you would find that the movies are clustered based on how the makers of the movie perceive the roles of that character name. For example, if Jack is a common name in some movies they will be clustered closer together.

In order to prevent some of these issues, we should choose only the 1,000 most rated movies, instead of random ones. This could be done by using the data from user ratings and selecting the movie names from that file, then using the movie names with the plots data to write the movie list file. Then we would cluster the most popular movies. There are some potential problems with this. However, there is still the issue of only getting Hollywood’s perception of character names, instead of similarities of movies. To handle this, we could use the quotes movie data from IMDB (which stores the character names as the first text on the line before the colon). With this file, we could just erase or not store any occurrences of words that appear before a colon at the beginning of a new line in the corresponding movie plot data. Therefore eliminating or at least limiting the effect of character names on our clustering results.

Sunday, March 15, 2009

Assignment 5

Visualization

An example: TouchGraph Facebook Browser, found at visualcomplexity.com

This visualization makes use of Facebook's friend relationships, seeing who is friends with who, as well as the tagged photos feature, seeing who is in photos with who.

I downloaded the application to my facebook page and was instantly able to see the concentrations of my friends. Obviously the largest concentration was Mary Washington, with Washington D. C. second, and Virginia Tech, George Mason, and Battlefield High School also evident. I then selected how many friends to show in my graph, and changing that number dynamically changed the layout of the graph, with connections and concentrations appearing as well as shifting. The lines in between people indicate that those users are friends on facebook, and the number on the line is the number of photos which those two users are tagged in together, blank if there are none.
What is most impressive is that with this information the graph atomically clusters groups of my friends and displays them in different colors. When I had the graph display 150 of my friends it correctly separated my friends on the Track & Field team(purple) from my friends that I made freshman year in my dorm(red). On the left side are grouped my friends that do not go to Mary Wash, with many from my high school graduating class (orange), and many from my high school Track team (blue).

When I had the graph display all 600+ of my facebook friends it took several minutes to process but eventually the results were impressive.



2008 Regular Season MLB Team Stats
I found these statistics on the CBS sports web page, and imported them into ManyEyes.com
Analysis: The data includes statistics per team over the course of the 2008 regular season, with hitting and pitching statistics. Including hits, runs, home runs, batting avg, opponent batting average, and pitching ERA, as well as several other stats.
In the first graph, there is an extremely strong positive correlation between hits and batting average, the size of the dots represents home runs which are evenly spread over the distribution. In fact the team with the third most hits has a very small number of home runs and the White Sox, who have the most home runs, are close to the medium for hits and batting average. The graph becomes more interesting if the y-axis is changed to "runs," this shows that teams with more home runs (regardless of number of hits) are more likely to score more runs, with one exception being the Minnesota Twins.
In the second graph, i have on the x axis the number of runs scored by the team, and on the y-axis the number of runs scored by opponents. With just this information there is no clear correlation, with teams spread out all over the graph. However when i make the dot size the number of strike-outs pitched by the team, there is a clear indication that teams with more strike-outs let up less runs, while teams with few strike-outs pitched tend to give up more runs.
Many relationships amongst data can be explored by using the three-variable scatter plot. I have found it very interesting as well as powerful for bringing distinction to trends and correlations that are not always clear by just looking at an excel spreadsheet.

Friday, March 6, 2009

Assignment 4

Clustering

Chapter 3 of PCI introduces data clustering, which is a method for discovering and visualizing groups of data. Using Python and feedparser I generated a list of URLs and then a list of words that will be used in the counts of each blog. Using theses two lists I created a text file containing the word counts for each blog. This dataset was assembled to explain commonalities between popular blogs by analyzing their usage of different words.

Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups, with the results usually being viewed in a dendrogram which displays the nodes arranged into their hierarchy.

Using Python, I read in the word count file, used Pearson correlations, and then the hierarchical algorithm to create the clusters. Next to be able to better visualize the clustering results I used Python to draw a dendrogram of the results:

The chapter then went on to describe Column Clustering and K-Means Clustering. By clustering the results of the desired possessions from the zebo.txt file, the following dendogram was produced.

Finally the chapter closed by discussing ways to diagram data in two dimensions. Using a Python algorithm that scales the distance between items and then draws them onto a 2D plane I created the following representation of the list of blogs.

Part II:

I'm having trouble creating a data list that will work with the Python code.

Saturday, February 14, 2009

Assignment 3

My group used PHP to implement our music recommendation system with the Last.FM

Ross with his extensive knowledge and experience in PHP led development.

The web site is available at http://serinodesigns.com/lastfm/

Some features are:

  • You can search by artist or by artist and track. This will get similar elements to your search query.
  • For each artist that is found a request is made to get their top album. This album is displayed as well as the artist name, and similar tracks if applicable.
  • Resulting albums are displayed as well as the artist name, and similar tracks if applicable.
  • You can do automatic new searches by simply clicking on an artist's name or on their track if applicable.
  • You can search for a Russian blog mp3 for a given artist.
  • You can directly link to an artist's last.fm profile.
  • Album art is displayed.

A search for The Beatles results in the top albums for John Lennon, Paul McCartney, Paul McCartney and the Wings, Ringo Starr, The Who, Pink Floyd, The Rolling Stones, The Beach Boys, and The Kinks.

A track search for "Yesterday" by The Beatles results in the top tracks "Imagine" by John Lennon, "My Generation" by the Who, "Paint it Black" by the Rolling Stones, "Pinball Wizard" by the Who, "Sympathy for the Devil" by the Rolling Stones, "Lola" by the Kinks, "You've Really Got Me" by the Kinks, "House of the Rising Sun" by the Animals, and "Down on the Corner" by CCR.

The site in my opinion is very successful at allowing a user to type in their favorite artists and tracks to discover hopefully new artists and songs to listen to, expanding their musical expertise.