WebHarvy version 3.2 released !

We have made several improvements and feature additions to our popular web scraping software WebHarvy. Most of the new features added in this release were recommended by WebHarvy’s existing customers. We would like to thank everyone who helped us test and improve this release while in beta.

The changes are :-

  • Supports scraping data from web pages where more data is loaded when page is scrolled to the end (more details)
  • Supports scraping data from web pages where more data is loaded when a ‘load more data’ or ‘show more content’ type button/link is clicked (more details)
  • Supports editing URLs associated with a configuration (more details)
  • Supports editing keywords associated with a configuration (more details)
  • Supports downloading images whose URL is obtained after applying Regular Expression (RegEx) on the HTML source of selected content
  • Ability to select category links one-by-one, during configuration (more details)
  • Refined ‘Capture following text’ option (more details)
  • Multiple groups in a single RegEx string captured (more details)
  • Handles different layouts used by Amazon for displaying product details like ASIN
  • Advanced Miner Options (more details)
  • Automatically checks for new updates
  • Authentication support for private proxies while scraping data from HTTPS websites (more details)
  • Minor bug fixes and several improvements

The latest version may be downloaded from https://www.webharvy.com/download.html

Posted in Release update, WebHarvy | Tagged , | Leave a comment

Use ‘Capture Following Text’ option to scrape data from details pages

While extracting data from details pages (page reached by navigating a link from the start page), it is recommended that the ‘Capture Following Text‘ option be used whenever possible to correctly and consistently scrape data.

This is because the layout and the amount of data displayed in details pages may not be consistent. For example, if you are trying to scrape Amazon products listing, the data displayed in the product details page (page reached by clicking the product link from the search results) may vary slightly from product to product. Here, if you are tying to extract the Shipping Weight under Product Details, instead of clicking on the data (example: ‘1.2 pounds’) click on the heading ‘Shipping Weight’ and apply the ‘Capture following text’ option under the ‘More Options’ button.

Watch the demo :-


So in summary, if the data to be extracted comes under a heading, always click the heading and apply the ‘Capture following Text’ option. This ensures that the data is scraped from all similar pages without missing any, even if the page contents varies slightly.


Posted in WebHarvy, WebHarvy Feature | Tagged , , , | Leave a comment

Scrape HTML

WebHarvy allows you  to scrape HTML of page contents in addition to plain text. In the Capture window, click ‘More Options’ button and select the ‘Capture HTML’ option to scrape the HTML of the selected content.

To capture only a portion of the displayed HTML, you may select and highlight the required portion before clicking the Capture button.

Usually Regular Expressions are applied over the HTML source of the content to extract the data of interest like image URL or hidden fields like phone number.

The following video shows how the ‘Capture HTML’ option is used along with Regular Expressions to correctly extract the product price.

Try out the free evaluation copy of WebHarvy from https://www.webharvy.com/download.html.

Posted in WebHarvy, WebHarvy Feature | Tagged , , | Leave a comment

Scraping hidden (click to display) fields using WebHarvy

Certain web pages require that you to click on a link or button for the data to be displayed. There are many websites where email addresses or phone numbers are partially displayed, they will be fully displayed only if you click on them.

The ‘Click’ option under ‘More Options’ button in the Capture Window lets you scrape data in such scenarios. (See https://www.webharvy.com/tour1.html#ScrapeHidden). 

The following video shows how this option can be used to scrape hidden fields.

Here the phone numbers are partially displayed. Using the Click option, they can be made fully visible and then scraped.

To know more about the features of WebHarvy, see the product feature tour at https://www.webharvy.com/tour.html.

Posted in WebHarvy, WebHarvy Feature | Tagged , , | Leave a comment

Scrape with Regular Expressions using WebHarvy

WebHarvy is designed as a ‘point and click’ visual Web Scraper. The design concentrates on easy of use, so that you can start scraping data within few minutes after downloading the software.

But in case you need more control over what needs to be extracted you can use Regular Expressions (RegEx) with WebHarvy.  WebHarvy allows you to extract data by matching RegEx strings on text content as well as on HTML source of the web page.

If you are new to Regular Expressions, see http://en.wikipedia.org/wiki/Regular_expression.

The following video shows how WebHarvy can be used to scrape the image URL from a web page by applying Regular Expression.

The ‘Capture More Content’ feature comes in handy here (as shown in the video) to make sure that the selected text contains the data (text or HTML code) of interest, before RegEx string is applied.

Regular Expressions can also be applied directly on the text content of the page as shown in the following video.

To explore further download the latest version of WebHarvy from https://www.webharvy.com/download.html.

Posted in WebHarvy, WebHarvy Feature | Tagged , , , , , , | Leave a comment

WebHarvy 3.1 (Minor Update)

The 3.1 update of WebHarvy which was released yesterday (July 24) has the following changes.

  • Added option to Tag captured data rows with corresponding Keyword/Category. (Applicable only for Keyword/Category based Scraping). See the new Miner Settings Window (Edit menu – Settings)
  • Option to separately set Page Load Timeout and AJAX Load Wait Time in Miner Settings.
  • Option to edit the start URL / Post Data / Headers for the configuration directly from the UI, without editing the XML configuration file. (under Edit menu – Edit Options)
  • Updates related to Category Scraping, Capture Text following a Heading, Mining multiple pages
  • Bug Fixes

Download and install the latest update from https://www.webharvy.com/download.html.

Posted in Release update, WebHarvy, WebHarvy Feature | Tagged , , , , , | Leave a comment

WebHarvy Version 3.0 Released !

We are happy to announce the release of WebHarvy 3.0. We have added a lot of new features in this major update. The feature/changes list for this update is the longest among all product updates which we have done till date. Here we go. .

  • Added the following options in the Capture Window (grouped under ‘More Options’)
    • Capture following text: Improved by using brute force search for all elements in the page
    • Capture HTML: Option to scrape HTML of selected element
    • Capture Text as File: Option to scrape text and save it as a local file (useful while scraping articles and blog posts)
    • Click: Ability to scrape hidden (partially displayed) fields in webpages which require a click from the user to be displayed in full. For example phone numbers or email addresses which are displayed completely only if you click them.
    • Apply Regular Expression: Option to apply Regular Expressions (RegEx) on captured text. RegEx can be applied even after applying ‘Capture following text’, ‘Capture HTML’ & ‘Capture More Content’ options.
    • Capture More Content: Option to capture more text than the selected text, captures parent element’s text. For example this would capture the entire article if you apply this option after having selected the first paragraph.
  • Option to individually select categories/links (one by one) for Category Scraping (Mine menu – Scrape a list of similar links)
  • Export captured data as JSON
  • Ability to mine data from tables (row-column / grid layout)
  • Ability to mine pages which has fewer (less than 10) data items
  • Option to test proxies before using them (Edit menu – Settings – Proxy Settings)
  • Non responsive proxies are skipped during mining. Mining would not stop because of a bad/non-responsive proxy in the list.
  • Option to manually add URLs to an existing configuration (Edit menu – Add URLs to configuration)
  • Option to remove duplicates while mining (Edit menu – Settings – Miner)
  • Added ‘Hourly’ frequency option in Scheduler (Mine menu – Scheduler)
  • Added option to export data directly to database for scheduled mining tasks & command line
  • Added ‘Clear’ option in Edit menu which will clear both the browser and data preview pane
  • Language encoding defaulted to ‘utf-8’ for file exports (XML, CSV etc)
  • CSV/Database export : handles delimiters (comma, quotes etc) in captured data
  • Keyword/Category scraping allowed for 2 entries in evaluation version
  • Rendering issues with in-built browser fixed – defaults to IE 9 rendering
  • New Installer built with InstallShield

Download the latest installation of WebHarvy Web Scraper from https://www.webharvy.com/download.html.

Posted in Release update, WebHarvy, WebHarvy Feature | Tagged , , , , , | 2 Comments

USBTrace version 2.8 Released !

We’ve just released the latest update of USBTrace, the USB analyzer software for Windows. The changes in this update are :-

  • Added option to timestamp captured requests in ‘system’ time (HH:MM:SS:milliSeconds)
  • Added decoding of bConfigurationValue in Configuration Descriptor
  • Captured Data Export (HTML, CSV, XML) made faster
  • Added headers for HTML and XML export files
  • Updated USB device list for VID/PID decoding
  • Layout of Search/Filter/Trigger windows changed
  • More support for Windows 8 and USB 3.0 (SuperSpeed USB)
  • Minor bug fixes

You may download the 15 days free trial version of USBTrace USB Analyzer from http://www.sysnucleus.com/usbtrace_download.html.

Posted in Release update, USBTrace, USBTrace Features | Tagged , , , , , | Leave a comment

To all customers of UniBlue’s software who have incorrectly installed our drivers

This post is for all customers of UniBlue’s driver update software who have incorrectly installed our software’s drivers.

First off, kindly be informed that the drivers which are incorrectly installed in your systems (which has rendered most of your USB devices unusable) is a software component which is exclusively used by our software – USBDeviceShare (http://www.sysnucleus.com/usbshare/).

We own this driver and have NOT authorised any third party company (including UniBlue) to redistribute our drivers. Our drivers are meant to be used along with and only with our software. They are meant to be installed only along with the installation of our software.The problem which you are currently facing is due to UniBlue’s software. We have contacted them regarding this, and have also requested them to remove our drivers from their updates. We are yet to receive a reply from them.

Update (Feb 20, 2013) : Uniblue Technical Support has informed us that they will investigate this issue and take immediate action

Since our drivers are involved and since many of UniBlue’s customers are contacting us regarding a solution to this problem, we are listing below the steps to be followed to remove our drivers from your system. For more assistance we request you to contact UniBlue itself. We have our own customers to support and it is beyond us to attend to each of you, mainly because the problem was initiated by UniBlue.

The solution is to remove the USBDeviceShare’s driver completely from your system. For this you may need to plug in a non-USB mouse/keyboard to your system, if the current ones are not working. Please follow the steps below :

1. Go to c:\windows\inf directory
2. Search for all files (*.*) containing the text ‘udsstub.sys’
3. Delete all files displayed in the search result
4. Delete the file ‘udsstub.sys’ from c:\windows\system32\drivers folder

5. Open RegEdit (Run ‘regedit’) – in administrator mode
6. Delete the key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\udsstub (including subkeys)
7. Restart your system and see if the problem is solved.

You may also try to update the device driver directly :-

1. Open device manager
2. Locate the node corresponding to non working devices
3. Right click and update driver
4. Select ‘Browse for driver on my computer’ and later ‘Let me pick from a list of drivers on my computer’ options
5. Select the correct driver from the list displayed

We hope that the above instructions will help you remove our drivers, but if it does not please contact UniBlue support. If they do not remove our drivers there is a chance that the same issue can happen again in the future.

Posted in USBDeviceShare | Tagged , , | Leave a comment

Web Scraping from Command Line

WebHarvy supports command line arguments so that you can run the software directly from the command line. This allows you to run WebHarvy from script or batch files, or to invoke it via code from your own applications.

To know more, read : Running WebHarvy Web Scraper from Command Line

Posted in WebHarvy, WebHarvy Feature | Tagged , , , | Leave a comment