link(b)ot isn't User:LinkBot

Link(b)ot is a lightweight outgoing link suggester for articles on wikipedia. With future interfaces it should be able to run for any of the named wikimedia wiki's and localize to any other non-named wikimedia wiki. This all started because I posted 'Linkbot Procedure Suggestion' for some changes to make Linkbot easier to use, and got told the customary {{good idea}}, [[Be Bold]] and {{sofixit}}. As it stands, including the {{User:dbroadwell/linkthis}} template merely get's you on the list, but as I continute to test and modify those will be processed. Source for the project is availible at [1] though admittedly is behind current development. One could call this a Linkbot fork for the time being. Alpha testing will begin soon on the template interface.

How do i ?

To get this implemetation of link(b)ot to see your pages include the template {{user:dbroadwell/linkthis}} to add a page to the to do list.

Shortfalls

No handling of grammar and plurals ... ect
Only outbound links
Unsorted suggestions
No user notifying without adding one hit per article
Multiple server hits per suggestion

Server hit breakdown

+2 verify template and parse links per run
+4 server hits per article
- 2 being edit and save with new template
- 2 being edit and save suggestions

Linkbot Procedure Suggestion

I hope this isn't TOO radical, but I see a rather simplistic front end for linkbot. The suggestion of {{linkthis}} is JUST a suggestion. The category in the template is jsut an idea too ...

Editor saves page with {{User:dbroadwell/linkthis}} tag that indicates it needs to be parsed.
On the next 'parse' run for linkbot have it poll the 'what links here' for the {{User:dbroadwell/linkthis}} template into list of pages. (every hour?, 15 minutes? ... )
For every page in 'targetlist';
- parse page to build suggestions for target
- save changes to talk:target
- clear {{User:dbroadwell/linkthis}} tag from target page
- add note to user's talk page that requested {{User:dbroadwell/linkthis}}
- Log page was parsed by linkbot
For end run:
- move old entries from /todaylog to /archivelog

Templates

Trigger template [2]

This page was queued for Link(b)ot to process by dbroadwell. When it's done suggesting links this mesage will be removed and results will be posted to; Talk:Dbroadwell/php/links. For the moment, logging of this feature has yet to be implemented. Linkbot currenty checks at the developers discression untill a run button is implemented.

Post Template [3]

Link(b)ot has made suggestions for this page at Talk:Dbroadwell/php/links, please look over the links and add the acceptable suggestions to Dbroadwell/php. When done, remove the {{User:dbroadwell/linked}} template to remove this message.

The trigger template 'what links here' is processed and in the end swapped to the post template. As of yet there is no log (other than history) and no notification aside from a template swap, so watch the page.

LinkBot On Demand?

I was spouting some ideas on linking in #wikipedia and JRM asked me to write them up. So I did ... but in being bold, I putting them on the LinkBot FAQ Page. -- Dbroadwell 02:12, 1 Mar 2005 (UTC)

I'm reading up on php and already have a list on my hands of stuff to do. But we have the outline, If we flesh out to psudocode can the mailing list help? IE; Some of the X'Bot on Demand' I think I might be able to do, but mostly not. (without using python) I am interested, but an green at PHP with no one up the line really should I get stuck. Plus i have a bit of a handicap from BASIC being my first language in that I'm a very procedural coder. I realise it's still going to have to have a dump to compare a aginst and therefore won't see newly created articles to link against, but that can be fixed by updating and a old one will do for now. I see the time geting linkbot running well as equal to the time I'll spend manually searching for every link over the articles I write in. And not meaning to complain, but my PHP development environment is missing things ... like being able to run the code.

How long does it take to generate a 'linked' example for one article? -- Dbroadwell 14:28, 11 Mar 2005 (UTC)

I've been brainstorming and chatting, there are a few minor changes to the flow of my suggestion. As User:r3m0t pointed out to me, putting the template in the main namespace is unnecessary and suggested soemthing like {{User:Linkbot/linkthis}} instead, which is far more explicit than just {{linkthis}} and still yields a parseable list of articles to do. So the first step for me is parsing that list. Yeah I'm interested, and have found PHP to be grokable ... at least on a basic loopy level. No jumping through code hoops for me, more like beating on walls with code hammer untill hole achieved. But consider me working on;

\downarrow

'The other aspects of the suggestion would need to be implemented (clear the linkthis tag from target page, add note to user's talk page, Log page was parsed by linkbot, and move old entries from /todaylog to /archivelog).' Nickj (t) 00:03, 12 Mar 2005 (UTC)

The further things about memory and cvs will adress later. I'm going to skip the 'nice' logging to the requesters page, and am looking at just swapping templates from a todo to a done. But I do have som questions. Implied in what you have said so far, and correct my if I'm wrong, is that you are loading and parsing the 800mb database bump in that first 30 minutes. Can you give me an algorithm overview? What format does linkbot need the 'todo list' in? I gues I'm geting close to needing to see the code to patch it ... which i am trying to avoid needing as of yet. (I have a wiki to test on from a friend of mine, which has a very small DB as of yet ... so i can run test cases faster than a wikipedian one.) -- Dbroadwell 15:57, 14 Mar 2005 (UTC)

Re: LinkBot On Demand?

I'm alive, definitely alive, just slow! ;-) The suggestions are good, but these are the practical problems I see with them:

Resources (this is the main one):
- I've been too busy with work to do much of anything with the link suggester since December last year. So any new features or changes would really need to be implemented by someone else. Would you be interested?
Technical - (i.e. how is now):
- Currently, the script runs by hand whenever I get the opportunity, based on a database dump. Those dumps are supposed to be produced every week, but in reality this year they have sometimes had gaps of up to 6 weeks between successive dumps, and the dumps don't have any easily predictable release schedule. So the quickest way the suggester can using run (under the current scheme of getting all data from the database dumps) is once per week in a best-case scenario, or maybe once every 2 or 3 weeks in an average-case scenario.
- When it runs, all the suggestions, both to and from every page, get generated at once, in a process that takes some 50 or 60 hours. (i.e. This process is done for the entire Wikipedia, not just for one article).
- The script currently doesn't poll any data from the web, it only gets a database dump, so polling the what-links-here page for the linkthis template would have to be added, as well as getting the latest version of a page off the web for pages with that use the linkthis template.
- The other aspects of the suggestion would need to be implemented (clear the linkthis tag from target page, add note to user's talk page, Log page was parsed by linkbot, and move old entries from /todaylog to /archivelog).

Of the above, the resources is the primary issue though. What do you reckon? Are you up for it? -- All the best, Nickj (t) 06:23, 11 Mar 2005 (UTC)

I'm a very procedural coder.
- It's very procedural code. I don't think there's any object orientated stuff in it, no threading, etc. Basically, it starts, follows a series of instructions from start to finish, and then finishes.
How long does it take to generate a 'linked' example for one article?
- It takes about roughly a second, once it is running. The trick is though that building the index in memory to get it running takes ages (e.g. more than 30 minutes). This is because people want the suggested links to point to the real articles (not to redirects), and because checking every word and phrase in an article using database queries would be slower than checking memory. It could be changed to use just database queries though, rather than using a memory index, although it would be slower than a second (although not as slow as 30 minutes!). Basically, if you're going to be doing lots of articles, then it's worth building a memory index, but if you're going to only be doing one, then it's probably not.
Oh Jamesday says latest Database dump is about March 9th or so.
- Yep, that's the one I'm using now. If you're chatting to him on IRC, can you maybe ask him politely if there's some way we could do a test run for loading link suggestions from LinkBot via a CSV file? I asked him on his talk page about this previously, but with all the dramas that the Wikipedia has had in the past few months (new MediaWiki versions, power outages, new hardware, etc) it's completely understandable that he hasn't had time to get back to me about it.

-- All the best, Nickj (t) 00:03, 12 Mar 2005 (UTC)

Questions

Exactly how does linkbot log now?

Well, first the link suggester runs (taking ages), saving all the suggestions into a local database table, structured like so (with my explanation of what each column is added):

mysql> desc suggested_link;
+---------------------+------------------+------+-----+---------+----------------+
| Field               | Type             | Null | Key | Default | Extra          |
+---------------------+------------------+------+-----+---------+----------------+
| suggested_link_id   | int(10) unsigned |      | PRI | NULL    | auto_increment | (Just a unique key)
| source_link         | varchar(250)     |      | MUL |         |                | (The title of the article the suggestions are for)
| before_text         | varchar(255)     |      |     |         |                | (A bit of context before the suggestion)
| dest_link           | varchar(250)     |      | MUL |         |                | (What we suggest linking to)
| dest_label          | varchar(255)     |      |     |         |                | (The bit of text we suggest linking)
| after_text          | varchar(255)     |      |     |         |                | (A bit of context after the suggestion)
| found               | timestamp(14)    | YES  |     | NULL    |                | (Time this suggestion was found by the suggester)
| save_to_dest_time   | timestamp(14)    | YES  |     | NULL    |                | (Time suggestion saved to Wikipedia as an inbound link)
| save_to_source_time | timestamp(14)    | YES  |     | NULL    |                | (Time suggestion saved to Wikipedia as an outbound link)
| section             | varchar(255)     |      |     |         |                | (The section, e.g. "External Links", of suggested link)
+---------------------+------------------+------+-----+---------+----------------+
10 rows in set (0.00 sec)

Then after all the suggestions are saved, then the LinkBot can be told to upload suggestions (e.g upload suggestions for 100 articles). It just queries the above table, formats the output nicely, and saves that to the Wikipedia.

Instead of removing message, how about changing it?
What format does linkbot need the 'todo list' in?

At the moment it doesn't take a todo list, rather it just runs for a specified number of articles (e.g. 100 articles), and pulls them out in alphabetic order (I think), so most/all of the small number of articles done so far start with 'A'.

I know that the all titles is unsuitable, but are you doing a select from cur where redirect=0 in the first half hour to get list of real articles)?

The setup phase is querying namespace = 0, to build an index in memory. It queries redirects to find what they point to, as well as real articles, and stores all of this memory. It also stores partial phrases in a separate index - so to use your example, if there were an article called "2004 Presidential Election", then it would store that there was an article called "2004 Presidential Election" in one index, and in the other it would store that there were (say) 10 articles starting with "2004", and (say) 2 articles starting with "2004 Presidential", and one article starting with "2004 Presidential Election". That way it makes linking phrases much easier, because you can tell whether you should keep examining whether a phrase matches or not.

Any reason you can't save that to a smaller (less than 4.7mb database of just article hashes)?

I can't see any reason why that wouldn't work - I'm just using the cur database dump, because that way I get the article titles and the article text all in one table, and the full link suggester uses both of these bits of information. However from the point of view of making a quicker-running outwards-links-only version, then the article titles is probably all that's needed.

Can you point me to his history example of linkbot's output? (I've never seen one)

Probably have a look through Special:Contributions/LinkBot. -- All the best, Nickj (t) 01:37, 15 Mar 2005 (UTC)

I'm going to skip the parsing out and logging to the user page, it's more investication than I want to do at the moment for a nicety.

link(b)ot interfaces

template: Designed, coded, awaiting testing todofile: Designed, coded, awaiting testing Webform: Designing Logging: Designing (including user notification)

Notes; I don't actually remove the articles from articletodo so that at the end I can log once per run instead of once per $article.

All subs except the full flow implemented.

Need to install CURL and get cookieness going to start testing.

flow for template interface

check for 'go' condition template (len(gettext($article)))
process links to articletodo (regex strip)
open all titles database (filetoarray)
foreach articletodo
- create suggestions
- create summary
- edit page
- swap template
- post new page
- sleep(4) // be nice to servers
- edit suggestions page
- format suggestions
- post suggestions
- sleep(4) // be nice to servers

// do logging

Flow for todofile interface

open articletodo (filetoarray)
open all titles database (filetoarray)
foreach articletodo
- create suggestions
- create summary
- edit page
- swap template
- post new page
- sleep(4) // be nice to servers
- edit suggestions page
- format suggestions
- post suggestions
- sleep(4) // be nice to servers

// do logging

flow for webform interface

if articlename querry server for text
if text
- open all titles database (filetoarray)
- create suggestions
- format suggestions
- display suggestions

// how to log is a big question here

alpha test

I'm also very close to my alpha test for this implementation.

Sounds good! Do you want/need the LinkBot source to do this? I'm happy to put it under the GPL and send it to you. However I do want to tidy it up a bit (even tidied up it'll still be a bit messy, since it has to backtrack sometimes, due to trying to make the longest links possible). Also I'll be interstate for work on Wed / Thurs / Friday - so probably the earliest I can get it to you is Monday or Tuesday of next week. -- All the best, Nickj (t) 07:24, 29 Mar 2005 (UTC)

OK, I seriously underestimated how long that would take! However, Source code for LinkBot is now available. It's a bit of a mess, sorry! And sorry to take so long to get back to you, I've been insanely overloaded with work. All the best, Nickj (t) 02:27, 27 Jun 2005 (UTC)