Bug 383400 - MediaWikiApiImageFetchingStrategy.java does not fetch all images
Summary: MediaWikiApiImageFetchingStrategy.java does not fetch all images
Status: RESOLVED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: Mylyn (show other bugs)
Version: 1.7   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 2.0   Edit
Assignee: Carsten Hammer CLA
QA Contact:
URL:
Whiteboard:
Keywords: contributed
: 418964 (view as bug list)
Depends on: 418964
Blocks:
  Show dependency tree
 
Reported: 2012-06-25 03:50 EDT by Carsten Hammer CLA
Modified: 2013-11-18 10:25 EST (History)
3 users (show)

See Also:


Attachments
change to take into account review (2.62 KB, patch)
2012-07-09 18:22 EDT, Carsten Hammer CLA
no flags Details | Diff
mylyn/context/zip (9.62 KB, application/octet-stream)
2012-07-09 18:22 EDT, Carsten Hammer CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carsten Hammer CLA 2012-06-25 03:50:03 EDT
Build Identifier: 1.7.0-SNAPSHOT

The mediawiki api limits the amount of entries sent back. You have to query several times to get all files. This is currently not implemented. I provide an implementation:

/*******************************************************************************
 * Copyright (c) 2007, 2010 David Green and others.
 * All rights reserved. This program and the accompanying materials
 * are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
 *
 * Contributors:
 *     David Green - initial API and implementation
 *******************************************************************************/

package org.eclipse.mylyn.internal.wikitext.mediawiki.core.tasks;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.apache.tools.ant.BuildException;
import org.apache.tools.ant.Project;
import org.apache.tools.ant.taskdefs.Get;
import org.eclipse.mylyn.wikitext.core.util.IgnoreDtdEntityResolver;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

class MediaWikiApiImageFetchingStrategy extends ImageFetchingStrategy {

	private final Pattern imageTitlePattern = Pattern.compile("(?:Image|File):(.+)"); //$NON-NLS-1$

	private URL url;

	private String pageName;

	@Override
	public Set<String> fetchImages() {
		if (pageName == null || pageName.length() == 0) {
			throw new BuildException("please specify @pageName"); //$NON-NLS-1$
		}
		if (!pageName.equals(pageName.trim())) {
			throw new BuildException("@pageName must not have leading or trailing whitespace"); //$NON-NLS-1$
		}

		String base;
		try {
			base = url.toURI().toString();
		} catch (URISyntaxException e) {
			throw new BuildException(e);
		}
		if (!base.endsWith("/")) { //$NON-NLS-1$
			base += "/"; //$NON-NLS-1$
		}

		ImageFetchingContentHandler contentHandler = new ImageFetchingContentHandler();
		String gimcontinue = null;
		Set<String> filenames = new HashSet<String>();
		final SAXParserFactory parserFactory = SAXParserFactory.newInstance();
		parserFactory.setNamespaceAware(true);
		parserFactory.setValidating(false);
		do {
			contentHandler.setGimcontinue(null);
			URL apiUrl;
			try {
				String queryString = String.format(
						"action=query&titles=%s&generator=images&prop=imageinfo&iiprop=url&format=xml%s", //$NON-NLS-1$
						URLEncoder.encode(pageName, "UTF-8"), (gimcontinue == null ? "" : "&gimcontinue=" + URLEncoder.encode(gimcontinue, "UTF-8"))); //$NON-NLS-1$
				apiUrl = new URL(base + "api.php?" + queryString); //$NON-NLS-1$
			} catch (Exception e) {
				throw new BuildException("Cannot compose API URL", e); //$NON-NLS-1$
			}

			Reader input;
			try {
				input = new InputStreamReader(new BufferedInputStream(apiUrl.openStream()), "UTF-8"); //$NON-NLS-1$
			} catch (IOException e) {
				throw new BuildException(String.format("Cannot contact %s: %s", apiUrl, e.getMessage()), e); //$NON-NLS-1$
			}

			try {
				SAXParser saxParser = parserFactory.newSAXParser();
				XMLReader xmlReader = saxParser.getXMLReader();
				xmlReader.setEntityResolver(IgnoreDtdEntityResolver.getInstance());

				xmlReader.setContentHandler(contentHandler);

				try {
					xmlReader.parse(new InputSource(input));
					gimcontinue = contentHandler.getGimcontinue();
				} catch (IOException e) {
					throw new BuildException(String.format("Unexpected exception retrieving data from %s", apiUrl), e); //$NON-NLS-1$
				} finally {
					try {
						input.close();
					} catch (IOException e) {
						// ignore
					}
				}
			} catch (SAXException e) {
				throw new BuildException("Unexpected error in XML content", e); //$NON-NLS-1$
			} catch (ParserConfigurationException e) {
				throw new BuildException("Cannot configure SAX parser", e); //$NON-NLS-1$
			}

		} while (gimcontinue != null);
		int fileCount = 0;
		for (Map.Entry<String, String> ent : contentHandler.imageTitleToUrl.entrySet()) {
			String title = ent.getKey();
			String imageUrl = ent.getValue();
			Matcher titleMatcher = imageTitlePattern.matcher(title);
			if (titleMatcher.matches()) {
				String name = titleMatcher.group(1);
				name = name.replace(' ', '_');
				String qualifiedUrl = base;
				if (imageUrl.matches("https?://.*")) { //$NON-NLS-1$
					qualifiedUrl = imageUrl;
				} else {
					if (imageUrl.startsWith("/")) { //$NON-NLS-1$
						qualifiedUrl += imageUrl.substring(0);
					} else {
						qualifiedUrl += imageUrl;
					}
				}

				log("Fetching " + qualifiedUrl, Project.MSG_INFO); //$NON-NLS-1$
				Get get = new Get();
				get.setProject(getProject());
				get.setLocation(getLocation());
				try {
					get.setSrc(new URL(qualifiedUrl));
				} catch (MalformedURLException e) {
					log("Skipping " + url + ": " + e.getMessage(), Project.MSG_WARN); //$NON-NLS-1$ //$NON-NLS-2$
					continue;
				}
				get.setDest(new File(dest, name));
				get.execute();

				filenames.add(name);
				++fileCount;
			} else {
				log(String.format("Unexpected title format: %s", title), Project.MSG_WARN); //$NON-NLS-1$
			}
		}
		log("Fetched " + fileCount + " image files for " + pageName, Project.MSG_INFO); //$NON-NLS-1$ //$NON-NLS-2$

		return filenames;
	}

	public URL getUrl() {
		return url;
	}

	public void setUrl(URL url) {
		this.url = url;
	}

	public String getPageName() {
		return pageName;
	}

	public void setPageName(String pageName) {
		this.pageName = pageName;
	}

	private class ImageFetchingContentHandler implements ContentHandler {

		public void setGimcontinue(String gimcontinue) {
			this.gimcontinue = gimcontinue;
		}

		public String getGimcontinue() {
			return gimcontinue;
		}

		private final Map<String, String> imageTitleToUrl = new HashMap<String, String>();

		private String currentPage = null;

		private boolean inImageInfo = false;

		private String gimcontinue = null;

		public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
			if ("page".equals(localName)) { //$NON-NLS-1$
				currentPage = atts.getValue("title"); //$NON-NLS-1$
			} else if ("images".equals(localName)) { //$NON-NLS-1$
				gimcontinue = atts.getValue("gimcontinue"); //$NON-NLS-1$
			} else if ("imageinfo".equals(localName)) { //$NON-NLS-1$
				inImageInfo = true;
			} else if (inImageInfo && "ii".equals(localName)) { //$NON-NLS-1$
				imageTitleToUrl.put(currentPage, atts.getValue("url")); //$NON-NLS-1$
			}
		}

		public void endElement(String uri, String localName, String qName) throws SAXException {
			if ("page".equals(localName)) { //$NON-NLS-1$
				currentPage = null;
			} else if ("imageinfo".equals(localName)) { //$NON-NLS-1$
				inImageInfo = false;
			}
		}

		public void characters(char[] ch, int start, int length) throws SAXException {
		}

		public void endDocument() throws SAXException {
		}

		public void endPrefixMapping(String prefix) throws SAXException {
		}

		public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
		}

		public void processingInstruction(String target, String data) throws SAXException {
		}

		public void setDocumentLocator(Locator locator) {
		}

		public void skippedEntity(String name) throws SAXException {
		}

		public void startDocument() throws SAXException {
		}

		public void startPrefixMapping(String prefix, String uri) throws SAXException {
		}

	}

}


Reproducible: Always

Steps to Reproduce:
1.create a mediawiki page with more than 10 images while the configured limit is 10 for access through the api
2.convert the mediawiki page to eclipse help
3.look at the amount of pulled images
Comment 1 Carsten Hammer CLA 2012-06-27 17:29:20 EDT
Trying to provide a patch file by dragging from git history does not work as expected. I generates a file too big to be attached for these few lines. Either the eclipse git support is buggy or I do something the wrong way. However commiting to gerrit does not work anyway. So here the diff as shown in eclipse git history:


commit e2c6cca966e6d756ec0fc7f75ef76a046fe80e2c
Author: chammera1a <carsten.hammer@oce.com> 2012-06-27 17:02:51
Committer: chammera1a <carsten.hammer@oce.com> 2012-06-27 17:37:34
Parent: 58b45e9c4b8bc1e12f986f7bb8887ae8fb2f5af8 (bug 348237: wikitext doesn't handle image file names properly  https://bugs.eclipse.org/bugs/show_bug.cgi?id=348237)
Branches: master



Change-Id: I56e5f418b8c746f1ddba362142d1d6e34588c4e6

org.eclipse.mylyn.wikitext.mediawiki.core/src/org/eclipse/mylyn/internal/wikitext/mediawiki/core/tasks/MediaWikiApiImageFetchingStrategy.java
diff --git a/org.eclipse.mylyn.wikitext.mediawiki.core/src/org/eclipse/mylyn/internal/wikitext/mediawiki/core/tasks/MediaWikiApiImageFetchingStrategy.java b/org.eclipse.mylyn.wikitext.mediawiki.core/src/org/eclipse/mylyn/internal/wikitext/mediawiki/core/tasks/MediaWikiApiImageFetchingStrategy.java
index 2231f57..3c32ca5 100644
--- a/org.eclipse.mylyn.wikitext.mediawiki.core/src/org/eclipse/mylyn/internal/wikitext/mediawiki/core/tasks/MediaWikiApiImageFetchingStrategy.java
+++ b/org.eclipse.mylyn.wikitext.mediawiki.core/src/org/eclipse/mylyn/internal/wikitext/mediawiki/core/tasks/MediaWikiApiImageFetchingStrategy.java
@@ -56,7 +56,7 @@
 			throw new BuildException("please specify @pageName"); //$NON-NLS-1$
 		}
 		if (!pageName.equals(pageName.trim())) {
-			throw new BuildException("@pageName must not have leading or training whitespace"); //$NON-NLS-1$
+			throw new BuildException("@pageName must not have leading or trailing whitespace"); //$NON-NLS-1$
 		}
 
 		String base;
@@ -69,90 +69,97 @@
 			base += "/"; //$NON-NLS-1$
 		}
 
-		URL apiUrl;
-		try {
-			String queryString = String.format(
-					"action=query&titles=%s&generator=images&prop=imageinfo&iiprop=url&format=xml", URLEncoder.encode(pageName, "UTF-8")); //$NON-NLS-1$ //$NON-NLS-2$
-			apiUrl = new URL(base + "api.php?" + queryString); //$NON-NLS-1$
-		} catch (Exception e) {
-			throw new BuildException("Cannot compose API URL", e); //$NON-NLS-1$
-		}
-
+		ImageFetchingContentHandler contentHandler = new ImageFetchingContentHandler();
+		String gimcontinue = null;
 		Set<String> filenames = new HashSet<String>();
-
 		final SAXParserFactory parserFactory = SAXParserFactory.newInstance();
 		parserFactory.setNamespaceAware(true);
 		parserFactory.setValidating(false);
+		do {
+			contentHandler.setGimcontinue(null);
+			URL apiUrl;
+			try {
+				String queryString = String.format(
+						"action=query&titles=%s&generator=images&prop=imageinfo&iiprop=url&format=xml%s", //$NON-NLS-1$
+						URLEncoder.encode(pageName, "UTF-8"), (gimcontinue == null ? "" : "&gimcontinue=" + URLEncoder.encode(gimcontinue, "UTF-8"))); //$NON-NLS-1$
+				apiUrl = new URL(base + "api.php?" + queryString); //$NON-NLS-1$
+			} catch (Exception e) {
+				throw new BuildException("Cannot compose API URL", e); //$NON-NLS-1$
+			}
 
-		Reader input;
-		try {
-			input = new InputStreamReader(new BufferedInputStream(apiUrl.openStream()), "UTF-8"); //$NON-NLS-1$
-		} catch (IOException e) {
-			throw new BuildException(String.format("Cannot contact %s: %s", apiUrl, e.getMessage()), e); //$NON-NLS-1$
-		}
-		try {
-			SAXParser saxParser = parserFactory.newSAXParser();
-			XMLReader xmlReader = saxParser.getXMLReader();
-			xmlReader.setEntityResolver(IgnoreDtdEntityResolver.getInstance());
-
-			ImageFetchingContentHandler contentHandler = new ImageFetchingContentHandler();
-			xmlReader.setContentHandler(contentHandler);
+			Reader input;
+			try {
+				input = new InputStreamReader(new BufferedInputStream(apiUrl.openStream()), "UTF-8"); //$NON-NLS-1$
+			} catch (IOException e) {
+				throw new BuildException(String.format("Cannot contact %s: %s", apiUrl, e.getMessage()), e); //$NON-NLS-1$
+			}
 
 			try {
-				xmlReader.parse(new InputSource(input));
-			} catch (IOException e) {
-				throw new BuildException(String.format("Unexpected exception retrieving data from %s", apiUrl), e); //$NON-NLS-1$
-			} finally {
+				SAXParser saxParser = parserFactory.newSAXParser();
+				XMLReader xmlReader = saxParser.getXMLReader();
+				xmlReader.setEntityResolver(IgnoreDtdEntityResolver.getInstance());
+
+				xmlReader.setContentHandler(contentHandler);
+
 				try {
-					input.close();
+					xmlReader.parse(new InputSource(input));
+					gimcontinue = contentHandler.getGimcontinue();
 				} catch (IOException e) {
-					// ignore
-				}
-			}
-			int fileCount = 0;
-			for (Map.Entry<String, String> ent : contentHandler.imageTitleToUrl.entrySet()) {
-				String title = ent.getKey();
-				String imageUrl = ent.getValue();
-				Matcher titleMatcher = imageTitlePattern.matcher(title);
-				if (titleMatcher.matches()) {
-					String name = titleMatcher.group(1);
-					name = name.replace(' ', '_');
-					String qualifiedUrl = base;
-					if (imageUrl.matches("https?://.*")) { //$NON-NLS-1$
-						qualifiedUrl = imageUrl;
-					} else {
-						if (imageUrl.startsWith("/")) { //$NON-NLS-1$
-							qualifiedUrl += imageUrl.substring(0);
-						} else {
-							qualifiedUrl += imageUrl;
-						}
-					}
-
-					log("Fetching " + qualifiedUrl, Project.MSG_INFO); //$NON-NLS-1$
-					Get get = new Get();
-					get.setProject(getProject());
-					get.setLocation(getLocation());
+					throw new BuildException(String.format("Unexpected exception retrieving data from %s", apiUrl), e); //$NON-NLS-1$
+				} finally {
 					try {
-						get.setSrc(new URL(qualifiedUrl));
-					} catch (MalformedURLException e) {
-						log("Skipping " + url + ": " + e.getMessage(), Project.MSG_WARN); //$NON-NLS-1$ //$NON-NLS-2$
-						continue;
+						input.close();
+					} catch (IOException e) {
+						// ignore
 					}
-					get.setDest(new File(dest, name));
-					get.execute();
-
-					filenames.add(name);
-					++fileCount;
-				} else {
-					log(String.format("Unexpected title format: %s", title), Project.MSG_WARN); //$NON-NLS-1$
 				}
+			} catch (SAXException e) {
+				throw new BuildException("Unexpected error in XML content", e); //$NON-NLS-1$
+			} catch (ParserConfigurationException e) {
+				throw new BuildException("Cannot configure SAX parser", e); //$NON-NLS-1$
 			}
-			log("Fetched " + fileCount + " image files for " + pageName, Project.MSG_INFO); //$NON-NLS-1$ //$NON-NLS-2$
-		} catch (SAXException e) {
-			throw new BuildException("Unexpected error in XML content", e); //$NON-NLS-1$
-		} catch (ParserConfigurationException e) {
-			throw new BuildException("Cannot configure SAX parser", e); //$NON-NLS-1$
+
+		} while (gimcontinue != null);
+		int fileCount = 0;
+		for (Map.Entry<String, String> ent : contentHandler.imageTitleToUrl.entrySet()) {
+			String title = ent.getKey();
+			String imageUrl = ent.getValue();
+			Matcher titleMatcher = imageTitlePattern.matcher(title);
+			if (titleMatcher.matches()) {
+				String name = titleMatcher.group(1);
+				name = name.replace(' ', '_');
+				String qualifiedUrl = base;
+				if (imageUrl.matches("https?://.*")) { //$NON-NLS-1$
+					qualifiedUrl = imageUrl;
+				} else {
+					if (imageUrl.startsWith("/")) { //$NON-NLS-1$
+						qualifiedUrl += imageUrl.substring(0);
+					} else {
+						qualifiedUrl += imageUrl;
+					}
+				}
+
+				log("Fetching " + qualifiedUrl, Project.MSG_INFO); //$NON-NLS-1$
+				Get get = new Get();
+				get.setProject(getProject());
+				get.setLocation(getLocation());
+				try {
+					get.setSrc(new URL(qualifiedUrl));
+				} catch (MalformedURLException e) {
+					log("Skipping " + url + ": " + e.getMessage(), Project.MSG_WARN); //$NON-NLS-1$ //$NON-NLS-2$
+					continue;
+				}
+				get.setDest(new File(dest, name));
+				get.execute();
+
+				filenames.add(name);
+				++fileCount;
+			} else {
+				log(String.format("Unexpected title format: %s", title), Project.MSG_WARN); //$NON-NLS-1$
+			}
 		}
+		log("Fetched " + fileCount + " image files for " + pageName, Project.MSG_INFO); //$NON-NLS-1$ //$NON-NLS-2$
+
 		return filenames;
 	}
 
@@ -174,15 +181,27 @@
 
 	private class ImageFetchingContentHandler implements ContentHandler {
 
-		final Map<String, String> imageTitleToUrl = new HashMap<String, String>();
+		public void setGimcontinue(String gimcontinue) {
+			this.gimcontinue = gimcontinue;
+		}
+
+		public String getGimcontinue() {
+			return gimcontinue;
+		}
+
+		private final Map<String, String> imageTitleToUrl = new HashMap<String, String>();
 
 		private String currentPage = null;
 
 		private boolean inImageInfo = false;
 
+		private String gimcontinue = null;
+
 		public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
 			if ("page".equals(localName)) { //$NON-NLS-1$
 				currentPage = atts.getValue("title"); //$NON-NLS-1$
+			} else if ("images".equals(localName)) { //$NON-NLS-1$
+				gimcontinue = atts.getValue("gimcontinue"); //$NON-NLS-1$
 			} else if ("imageinfo".equals(localName)) { //$NON-NLS-1$
 				inImageInfo = true;
 			} else if (inImageI
Comment 2 Carsten Hammer CLA 2012-07-07 05:27:17 EDT
I added a patch fixing this as gerrit commit at https://git.eclipse.org/r/#/c/6537/. Is there something wrong with this approach to contribute? Or is there another better way to transfer contributions?
Comment 3 David Green CLA 2012-07-09 11:47:57 EDT
Carsten, thanks for posting a review.  I apologize for the slow response on this one.  

Creating a bug referencing a Gerrit review is definitely the best way to provide contributions.  Alternatively you can attach your changes to the bug by creating a patch.  It's very difficult to grok changes in a comment on a task.

It would be great to have some unit tests for this one.  Consider mocking up some of the components to avoid having to hit a server.

I've commented on the review.
Comment 4 Carsten Hammer CLA 2012-07-09 18:22:41 EDT
Created attachment 218470 [details]
change to take into account review
Comment 5 Carsten Hammer CLA 2012-07-09 18:22:44 EDT
Created attachment 218471 [details]
mylyn/context/zip
Comment 6 Carsten Hammer CLA 2012-07-10 17:44:04 EDT
I finally created a gerrit review containing your suggestions at https://git.eclipse.org/r/#/c/6710/
It seems by abandoning a unchanged second patchset at the old gerrit review I abandened not only the patch but the whole review. The gerrit user interface is somewhat confusing in parts.
Comment 7 David Green CLA 2012-07-13 19:15:19 EDT
Carsten, can you please answer these questions[1] as a comment on this bug

1. I have authored 100% of the content I'm contributing
2. I have the rights to donate the content to Eclipse
3. I contribute the content under the EPL

fn1. http://wiki.eclipse.org/Development_Resources/Handling_Git_Contributions
Comment 8 Carsten Hammer CLA 2012-07-14 05:18:20 EDT
Hi David,

1. I have authored 100% of the content I'm contributing
2. I have the rights to donate the content to Eclipse
3. I contribute the content under the EPL

Best regards,
Carsten Hammer
Comment 9 Carsten Hammer CLA 2012-07-21 04:22:59 EDT
Hi David,
anything else that I have to do to make it possible to merge it into mainstream? I think I dont have the right to do it myself. 
Best regards,
Carsten
Comment 10 David Green CLA 2012-07-27 10:17:30 EDT
Carsten, I've checked this change manually and it seems to work as expected, however I would like to see some unit tests to verify that it works as expected for responses containing a gimcontinue and responses without.
Comment 11 David Green CLA 2013-08-02 12:43:05 EDT
Closed as part of backlog clean-up.  Please re-open if you'd like to see this revisited, perhaps with a contribution.
Comment 12 Steffen Pingel CLA 2013-10-22 16:57:19 EDT
Reopening since this bug has been confirmed and there appears to be a fix in Gerrit. David, I think it would be good to consider this for the next (service) release.
Comment 13 David Green CLA 2013-10-25 19:39:51 EDT
Thanks for the contribution Carsten, sorry this has taken so long to get in.  I've merged the review.
Comment 14 David Green CLA 2013-10-25 19:40:50 EDT
*** Bug 418964 has been marked as a duplicate of this bug. ***
Comment 15 John Arthorne CLA 2013-10-27 16:44:30 EDT
I grabbed the latest nightly snapshot from Mylyn downloads page and tried but still have the same problem. I'm guess it just hasn't been picked up by a build yet.
Comment 16 David Green CLA 2013-11-18 10:25:31 EST
John, for a short period there new builds were not being published correctly.  It should be up there now.  Let me know if you have any problems.