Prof. Dr. Weber-Wulff: Bachelor Master, and IC topics

Prof. Dr. Debora Weber-Wulff	Thesis Topics

What is a bachelor's thesis?

A topic from your internship that you work on, for example the company had a horrible database or XML structure, and you sort it out on your own to show them how it could look.
A topic from your project that you continue, for example I had someone taking the E-Learning unit we produced and making it accessable for blind users.
A new topic that you agree on with an IMI professor.

The following topics, more or less diffuse, are on my current wishlist (and are possible for IC or as the basis for a Master's Thesis area). In general, I am interested in web topics, data mining, privacy, Android programming, plagiarism detection and documentation tools, and E-Learning.

Text Rewriting Classification
There are papers such as this one that say that they can find the amount of text rewriting in an academic paper. I would first like a student (probably a Master's student) to replicate the experiment in this paper using different sets of publications. Then I would like you to look for a better classifier.
Named Entity Recognition with Wikidata
I am fascinated by how difficult Named Entity Recognition is in the German language. One solution that comes to mind is to take a corpus of bi/tri-grams after parsing and to look them up in Wikidata. From there you should be able to determine and classify if they represent names of people, places, things, or are false positives. This would tend to be a Master's thesis.
Retractions in Wikidata
WikiCite is in the process of dumping lots of citation into Wikidata, but they are not getting the retractions and the Expressions of concern marked properly. How can you deal with this and get the retracted articles marked retracted and the retractions imported as well.
Finding author clusters on PubMed Central
There is much open data available on PubMed Central. Given the name of an author, can you plot the co-author graph? Now pick two authors who are co-authors, what does their co-author graph look like? Can you identify research groups by finding k-cliques?
Knit your VroniPlag Wiki scarf
The VroniPlag Wiki site sports a barcode representing the varying levels of plagiarism on the pages of a dissertation. The barcode is created in JavaScript by accessing Semantic MediaWiki data. There also exist knitting machines that can knit patterms. Can you automatically create scarves knitted to match the patterns of the barcodes (see Knit the Sky)? Can users choose their own colors? Include a random barcode generator? More ideas? Knitting social media tie-in?
Wikidata
There is a list of self-contained projects dealing with Wikidata that I feel would make great Bachelor's or even Master's projects, or an IC. I would be glad to discuss them with you, and send you on to someone at Wikidata you if you want to try your hand at them.
SPARQL queries for Wikidata for non-computer scientists
Wikidata has so much data now that using SPARQL to query it can result in amazing results. But there is a steep learning curve involved with SPARQL. Can you make a kind of "google for SPARQL", simple query interface that produces (and then runs) SPARQL on Wikidata?
Discovering Patterns in Editorial Boards
Journals have editorial boards posted on their web sites, sometimes there are 100 names and affiliations listed for just one journal. Some of the names are fabricated and appear on multiple boards, sometimes the names are slightly different, and sometimes the people are real but do not know that they are listed on these boards. I would like to have a data mining tool that identifies potential board member lists online, extracts names and affiliations, and then attempts to discover patterns and connections using machine learning algorithms. Visualizations of the database with Graphviz would be a very cool plus.
Curriculum-based OER guide
In Germany each state publishes Rahmenlehrpläne, curricula for each subject, grade, and school type that list the learning goals that are to be achieved during that year. These documents tend to be wordy PDFs that also include long tables. How could these be (easily) transformed into a navigational structure for a Semantic Media Wiki that lets a teacher quickly find materials that are available online and to determine their licensing model in order to see if it is usable.
Weka Miner
We are experimenting with the Weka Miner in Semantic Modelling, I would like someone to use the Weka Miner on the 400 billion results that I have from a collusion detection search. Can you come up with a classification that identifies the positives, i.e. the plagiarism pairs and clusters?
Citation Miner
A previous master's thesis looked into the extraction of citation information from scientific papers. I would like to apply this technique to a largish set of papers from different fields in order to train a machine learning system how to recognize citations. Then it would be fascinating to run this on a larger set of unknown papers. It would also be interesting to measure how the algorithm fares on dissertations (these are much larger than papers). I would also like to have frequency distributions of the citations prepared in order to identify the unused references ("garnish references") and the most frequent ones.
SIM_TEXTer-PlusPlus
Recently a student has managed to get the program and text comparison tool, SIM_TEXT, that does a wonderful job of comparing two text files and highlighting the text similarities, working as an online tool. There are some fiddly bits that need extending, for example, dealing with multiple copies of text portions and doing a self-comparison, and inputting pdf (can we use Apache Tika, for example?). And while we are at it, I'd like to have the new/old-directory comparison working under a nice GUI. The tool should enable a teacher to set up their own databases = directories of student papers (such as lab reports) on a per class basis and then compare new lab reports with all the old ones for the same class. The software needs to be browser-based open software, so that it can be used in schools (who don't have money for expensive software or other systems). All data must be stored only locally on a teacher's computer.
Java API Finder for beginners
The Java API has all the information you need in it, but it is a horrible mess, overloaded with information that confuses a beginner. And you navigate it by using Google with hopefully fitting search terms. Could there be a better way for beginners? Is it possible to set up a better search and navigation system that you don't have to throw away when the next version of the API appears?
Collusion Finder
The general case of finding plagiarisms on the open internet is a difficult one, but finding collusions (multiple students submitting the same or similar paper) is somewhat easier. There are a number of systems available, but none are easy to use. There in an open algorithm that is relatively good at finding common parts, but quite difficult to use and interpret. I would like to have a good interface for this system, and some additional bells and whistles. The software needs to be an open software, so that it can be used in schools (who don't have money for systems like this). And it should be integrateable in Moodle, so that all of the solutions to one exercise can be compared to all others
VroniPlag Wiki Report Generator
I currently produce PDF reports for VroniPlag Wiki semi-automatically with a lot of sleight of hand. I would like to have a system that takes material from a Semantic Media Wiki and produces both a colored PDF as well as LateX code that can produce the report. There are a number of experiments that have started, but a user-ready tool has still not emerged. Any takers?
Open Research Collaboration Environment
There are so many tools out there, each one is great for one small thing, but for helping a research group collaborate you need to bolt on so many other things. A wiki, a private file server, an Etherpad for incubating information, a chat for real-time communication, a ticket system, a to-do list, a references database. It is too much to dream of an integrated system that can do all this and NOT depend on servers in the USA or propritary tools?
SlideSync
MediaEvent Services is currently developing SlideSync, a self-service platform for streaming live presentations on the web, which is already in use by clients such as Lufthansa. Their toolset includes Ruby on Rails, jQuery, Bootstrap, Amazon Web Services as well as Wowza Media Server, and exemplary thesis topics could relate to server-side video and slide processing, analytics, realtime interaction, mobile devices and scalable server infrastructure. They use an agile development process (Scrum, Gerrit and Cucumber). Christian Becker (Telefon +49 6441 87087-22, E-Mail c.becker@mes-info.de) would be happy to meet with interested students at HTW or their office in Moabit to discuss the topic further. I will be glad to be the HTW advisor for this topic.
Blog Publisher
My blog, Copy, Paste & Shake, has been given an ISSN number. I would like to have a theme that displays the Volume and Number for each issue = month, and then produces a proper printable version with page numbers, etc., for submission to the German National Library. The publisher should produce PDF or LaTeX code, and be adaptable for both WordPress and Google blogs.
Group Commented Bibliography
There are lots of tools out there that let an individual keep a bibliography with information about literature. But there is no good, free group bibliography, much less one that will accept comments from persons on the materials.
Plagiarism in Journalism
Journalists seem to be constantly taking text from each other. Can this be visualized? Can you identify the parts of an online article that come from some other online article that is older and can you map this to some sort of timeline? This might let us see how many copies are actually influenced by one important article.

Real Old Stuff

Visualizations
With all this data floating around, much of it open, we are now able to produce visualizations that suggest new information or connections to other people. How can we automate the production of such visualizations? Flash is dead, so this has to be in HTML5. There are numerous possible theses here.
The Wikipedia Admin Game
The admins in the German Wikipedia see themselves as warriors, fighting against the evil bands of trolls that try to deface the Wikipedia. In this game, you are an admin who is trying to save the Wikipedia from fake edits, edit wars, teenagers, and the like. School lets out, and you have to be on your guard, as the bored kids fire up their computers and see what they can deface today. Are you quick enough to keep the Wikipedia running?
Wikipedia-Version-Tool
I need a tool that will display for me what the Wikipedia entry (and all links I follow) was on a particular date at a particular time, a sort of Way-Back machine for the Wikipedia. This will help me so that when people say "The Wikipedia said thus and such on day X" I can check it out. You will use the history data which is available to determine what the page actually displayed on that day. When I follow links with your tool I will see what the page linked to.
Moodle Book
Moodle Book is quite a good authoring system, but there are problems. Some have been determined in a thesis done last semester, others are my personal pet peeves. So I want you to upgrade the Moodle Book module to make it more useful for real-life E-Learning! For example, all subchapters remain collapsed until the chapter opens, not all subchapters open at the same time. This will involve using PHP and CSS.
SCORM Test
This is the standard for import and export of E-Learning materials. Except that it doesn't work. Everyone can import the SCORM that *they* export, but not necessarily the work of others. In this thesis I want the author to take a number of different kinds of E-Learning materials, export them from CLIX, Blackboard, Moodle, etc. and import them in other systems. I want a module for Moodle to be set up that can cope with all of the different SCORM problems.

Really old stuff

Wikipedia User Survey
The Wikimedia Research Network wants to conduct a user survey. I would like a thesis to conduct a pilot survey, constructing the survey software which will have to be easily localizable in many languages. The Wikimedia Research Network Privacy Policy must be respected during this work.
Diff for InDesign
In Wikis or in programming it is trivial to find the difference between two texts and report on the difference. It is also easy to record the history of edits and set up "reverts". InDesign also has a history, but I am not aware of a possibility to analyze the difference between two projects, save snapshots of the history, and revert changes back to a checkpoint. I would like for someone to investigate making a diff for InDesign (or Photoshop).
Wikipedia
There are any number of enhancements possible for Wikipedia, perhaps working through a taging mechanism for articles. You must be willing to have your results subject to Gnu Public Licence.
Wiki RSS-Reader
One can set a watchlist for single pages in the Wikipedia, and other Wikis offer a "last changes" RSS feed. But I have to visit all of these Wikis in order to see if anything I am interested in has changed. I would like to have a tool, similar to Awasu, that sucks in the RSS feeds, lets me know what is new, and if I click on a link, puts me in direct contact with that particular Wiki. I want this tool to work with many different kind of Wikis, and to cope with all sorts of RSS descriptions.
Wikis for Teaching
Wikis seem to me to be an ideal basis for constructing a learning management system. Can a synchronous service such as Chat systems in the Flash Communication Server be integrated into a Wiki-Collaboration and Annotation system? What does a Wiki for teaching really need?
What else interests me? E-Learning, Web 2.0, social computing, gender questions, e-voting, privacy.

I only take about 6-8 students as thesis students, as I want to have the time to advise you properly. Please ask as soon as possible, if you are interested in any of the topics. I will also sign for 3-4 external projects, but cannot offer to advise you. I will now stop taking reservations, it is first come, first served. I find it irritating when people "reserve" a position, and then give it up at the last minute. I pay for this by having to teach an extra class every few semesters to fill up the negative teaching hours I get assigned if I do not take enough students.

Students who do their thesis with me should expect to come to my office every week with results from the previous week. We will spend time in a group on questions, and we will discuss in the group what you bring with you for us to read. It is hard work, but tends to bring results, i.e. a finished thesis inside of the deadline.

I have some links about topics having to do with writing a thesis.

Last change: 2020-02-19 12:00