Note: this page is under construction. Details will follow.
Below is a list of suggested projects. You are highly encouraged to come up with alternative ideas. Ideally, we are looking for projects that are both innovative and contribute something, either to the Internet at large, or to some smaller group of users. The focus should be on modern Internet technologies. This includes, but not limited to:
- Google technologies (iGoogle, GData, Google Maps, Google Earth, GWT, Google Analytics).
- Open source projects such as Firefox, Thunderbird
- VoIP, video on demand (such as the Skype API)
- Large collaborative projects such as Wikipedia.
Here are some links to projects that we consider cool, but don't know how to develop into bigger workshop projects.
Not yet categorized
Projects in this category should belong to another, properly defined category.
Remember Pandora, the Internet radio station that offered personalized broadcasting according to your taste, based on past experience?
Make something like that, not necessarily audio-based: academic papers from the arXiv, jokes, poems, recipes are all good candidates.
You might even earn some money if your algorithm beats the Netflix challenge!
- Yehuda Koren from Yahoo! Research gives a talk about the Netflix challenge in the CS colloquium. Sunday, Nov 9, 11:15-12:15, Schreiber 309.
Server-assisted web page prefetching
Can one speed up web page download speed up by saving unnecessary round-trip times?
For instance, 90% of users connecting to cnn.com's web server just want to retrieve the main page. Can't the server start sending this information immediately after the TCP connection is established?
Similarly, when one asks for a HTML document, can't the web server already send all the associated files (CSS, JPG, etc.) automatically without the user having to request them individually?
The round-trip time is negligible for those living in the US, but for users in the rest of the work, it is a non-trivial issue, which becomes much worse if one works through a satellite connection.
A search engine for modern English usage
This is something that bothers me on a day-to-day basis.
I have some English expression or idiom in mind, but I'm not quite sure how it's supposed to be used: is it supposed to be used with "for", "to", "at"? Does it mean exactly what I think it means?
For instance, take "For whatever it's worth". How is it supposed to be used in a sentence?
Searching Google is quite disappointing, as it reveals articles whose header is simply "For whatever it’s worth", whereas what I wanted to see is how this expression is used in a real sentence. Same for many many other expressions like "so be it", "I can’t helping thinking that", etc.
Google searches for titles, keywords, and I want exactly the opposite: I want to search inside long bodies of text, such as the Gutenberg Project (although it's dominated by old fashioned English), blogs, news stories, etc.
Taking this idea one step further, the search engine could try to look for close approximation to the idiom. This is helpful in case I don't precisely remember the idiom, and only remember one word of it.
Net-based spell checker
The idea in one line: detect suspicious constructions in a document by running lots of Google queries on parts of the sentence.
For instance, one of the main problems for non-native speakers is the proper use of prepositions: is it "at", "on", or "to"?
Almost all non-native speakers make such mistakes all the time.
Here's how a Google spell checker could solve it:
Assume I write something like "I then stared on the cat for hours", not knowing that the correct preposition is "at" and not "on".
Microsoft Word does not detect the error. But Google can! Just run a Google search for "stared on the cat" (0 results) and "stared at the cat" (6,000 results).
A few more examples:
- Is it "work during the weekend", "work over the weekend", or maybe "work in the weekend"? Run Google search and see what you get!
- How about "If I may express my thoughts openly"? Is this a legal expression? Google search revealed only two hits, so I tend to think that it is not. But I don't know what's the correct way of saying it (nor do I know how to search for the correct expression in this case)
Cold boot attack on RSA
Project suggested by Adi Shamir.
Inter-channel wireless communication
We would like to establish a communication channel between two encrypted wireless channels.
Assume two computers are connected to two different encrypted wireless networks operating on the same channel.
Can the two computers communicate simply by using the interference caused by collisions among their transmissions? This can be really cool, and might even be publishable as a research paper (although I can't think of any application).
I think it also raises some interesting combinatorial questions about the best way to perform such communication.
Related work: Virtual WiFi
- Modern CPUs nowadays have two, four, or even more processing cores. To take advantage of this, the code should be multi-threaded.
- In addition, modern CPUs support specialized assembly instructions (e.g., SSE, AVX) help dealing with certain types of data operations.
- For some specialized tasks, the processing power of a graphics card might even be greater than what the main CPU can achieve.
Projects in this category focus on getting some code running faster. Much faster.
More text here.
Security and spam-control
The net is vast and infinite, and is abundant with malevolence. Viruses, worms, trojan horses, exploits, unsolicited commercial mail (AKA spam) and other black-hat activities corrupt the virtual soil and pose a threat to both individual privacy and online commerce.
Filtering and authentication counter-measures are constantly developed to overcome these but the battle is endless.
Projects in this category focus at fighting one of these nuisances.
IPsec is a suite of protocols for securing IP communications by authenticating and encrypting each IP packet of a data stream. It consists of protocols such as IKE for secure key exchange and ESP for authenticated and encrypted communication.
Many of the VPN implementations used nowadays are based on IPsec.
Projects in this sub-category add some functionality to IPsec.
Fast crash detection in IKE
Sometimes SAs (that is, the bundle of algorithms and parameters being used to encrypt and authenticate a particular flow in one direction) get out of sync, for instance when one side reboots.
Recovery then takes minutes, which is too long.
Here is an internet draft proposing a solution for this.
Your goal is to set up a network simulation (e.g., using VMWare or Virtualbox) and implement the draft for crash detection based on existing open-source IPSec implementations.
ESP with NULL-encryption
Sometimes we only need IPsec for authenticating packets but we still want them unencrypted.
A possible scenario for this is when the communication takes place in a subnet (e.g., some company's Intranet) monitored by a strict firewall that only allows packets of certain type. For instance, it does not allow some applications (e.g., Skype) or data (e.g., transferring confidential files).
Here is an internet draft proposing a modification to ESP that allows using this NULL-encryption mode.
Your goal is to implement the proposed extension to ESP with NULL-encryption and possibly also write a Wireshark add-on to detect it.
Human identification (aka CAPTCHAs)
In an attempt to limit spam by disallowing automated registration of email addresses, automated forum posting, etc. we often need to identify whether the user is actually a human or a computer program.
This kind of test is called a Turing test and its variant where questions are generated by a computer is called a CAPTCHA.
Projects in this sub-category strive for better, harder to break human identification schemes.
Human ID service
Instead of having each site admin copying the source of some CAPTCHA and using it to reduce spam on his forum/blog/whatever, implement an embeddable object that offers human identification to sites.
The site owner now only has to add 1-2 lines of code to query the widget whether human ID worked.
Once you get many people using your CAPTCHA service, you could add other types of tests (can't see images? click here for an audio test), add i18n (allow the site owner or the client to choose a language, for example), etc.
A recent story on the topic.
Google Images CAPTCHA
The widget would do the following:
- decide on a topic X, e.g., animals,
- decide on a specific word Y within this topic, e.g., elephant,
- Search Google Images for a random image of a Y,
- Display it to the user and ask "what kind of X is that?"
See also Asirra for a similar implementation.
PAPTCHA (Partially automated…)
Run real Turing tests (conducted by a human) instead of CAPTCHAs (conducted by a machine).
So far computers are embarrassingly bad at pretending to be human, just try out the currently best program and see that you can immediately tell it's a machine.
Where do we get the examiners from? From the same pool of people that we are supposed to check! This, of course, requires a critical mass of users, but such a mass can be achieved by a big service like Google or by a 'Human ID Service' as described above.
The simplest possible implementation is the following:
a user X connects to the site, and is faced with a live chat window with another user Y (preferably the chat should really be live, even at the key stroke level, like the old 'talk' program). He should decide if Y is a human or not.
X wins if the Y said that X is human.
There are several ways to improve this. First, one can include voice chat, but this might be difficult to implement.
More academically interesting approaches:
- Provide user X with several chat windows, asking him to tag each one as user or machine.
- Perform conference conversations between several users, asking them to reach a majority agreement on who's human and who's not.
- Introduce some machines on purpose to see if the user detects them (to make sure lazy people don't just click 'human' all the time).
Data. There's so much of it anywhere on the net.
When all you have is a long list of meaningless numbers, it's hard to deduce anything useful.
But when properly presented, usually in a graphic form, patterns in the data are easier to detect.
Projects in this category apply public or self-made visualization APIs to large data sets.
Take a large database of papers from the arXiv (for instance, download all papers under "quant-ph"), and analyze the bibliographic references in it.
You might even use some PageRank-type ideas to understand how meaningful the link is (e.g., [1,3,8,9,11] is weaker than , and one can
also use some keywords around the references).
Once analysis is complete, provide some online interface to visualize this data.
This by itself is a non-trivial task: both algorithmically (what's the best way to position papers) and implementation-wise.
Effective Visualization of Traffic Information
Computers on the web continuously communicate with each other. There are some good tools available for monitoring this traffic (Wireshark), but most of them present the information in a way that is very difficult to interpret (log files of a web server, capture files from the network card, etc.). Moreover, this information is often not real-time.
One recent project shows a very impressive and effective visualization of web server traffic. The goal of our project is to develop this idea further, possibly combining it with tools like Pcap or other monitoring tools. A major part of this project is to come up with a good design that effectively conveys the vast amounts of information available.
Your own project!
Remember that these were all suggestions and that the best is if you come up with better ideas!