|
$15 Gift Certificate for every bug you find.

This IP address to country database is provided as a CSV file that can easily be read by CSV Manager.
Got a question for us?
Just Ask!
|
Using the MaxMind IP Address Location Database with CSV Manager
CSV Manager can be used to easily and quickly load the
IP address lookup database
from MaxMind, allowing you to determine
which country an IP address belongs to. MaxMind supplies this
database in a number forms, including as a
CSV file.
PLEASE NOTE: We are an affiliate of MaxMind.
Which Country Does That IP Address Come From?
So you want to know where your website visitors are coming from?
You can solve this problem by using an
IP Address (example: 216.239.57.99) to
two-letter Country Code
(example: US)
lookup table. This table lists all known IP addresses beside the country they come from.
It's a pretty big table. MaxMind supplies this table as
a CSV file (among other formats), and Ricebridge supplies the CSV file reader.
Now all you have to do is:
- Load the data into a database from the CSV file
- Query the database to find the Country Code
- Update the database each week with changes
It's a simple case of programming!
What Does the Data Look Like?
We are going to use the free MaxMind GeoLite Country Database.
Commercial versions with higher accuracy are also available in the same data format, so this article still applies to them.
The data consists of rows with six fields.
Each row maps a range of IP addresses to a single country, by giving a start and end IP address.
The data fields we are interested in are the 3rd, 4th and 5th; the starting IP number, the ending IP number, and the Country Code.
What are IP numbers? Well, instead of 255.255.255.255, which is just 4 bytes stuck together,
we can say 0xFFFFFFFF in hex,
or 4294967295 in decimal.
So the IP number is just the IP address interpreted as a number.
Database Table Columns
CreateDatabase.sql
CREATE TABLE IPLookup (
BeginIPNum BIGINT NOT NULL,
EndIPNum BIGINT NOT NULL,
CountryCode SMALLINT NOT NULL
);
See the MySQL Numeric Types
reference page for more details about these column data types. We have used signed types to keep things simple for coding in Java,
but here's an description of how to squeeze the database size down.
OK, let's design a database table for these three fields.
We're going to use MySQL for our examples,
but the SQL should be quite portable.
We want to store the two IP numbers, and the country they map to.
First, we'll assign numbers to the Country Codes and store them in a 2-byte
integer field - hey, there could easily be
more than 256 countries in a few
years time. For the purposes of this article, we will hard-code the list
of two letter Country Codes that these numbers are assigned to,
but you can easily change this in your own code.
The IP numbers need to be stored in data fields that will keep them
positive. If you try to jam them into a 4-byte integer field, then IP
numbers with the
leftmost bit set will turn into negative numbers.
Of course, if your database has unsigned types, go ahead and use those.
Using CSV Manager to Load the Data
OK, we've got a CSV file on one side, and a database table on the other.
Sounds like a job for CSV Manager!
We want to create an INSERT statement for each row, and execute these as
fast as possible without wasting memory. We could read all the rows into memory at once,
with the
CsvManager.load(File)
method, but if we define our own
LineListener,
we can load and save the data as a continuous stream.
CSV Manager LineListeners are simple objects that wait around listening for
new CSV data records, one for each line of the CSV file. This data is handed over
as a String[], which is guaranteed to contain the minimum number of data fields.
A LineListener to Insert the Data
MaxMindInserter.java
public class MaxMindInserter extends LineListenerSupport {
protected Connection iConnection;
protected PreparedStatement iInsertStatement;
public MaxMindInserter( Connection pConnection ) {
iConnection = pConnection;
}
public void startLoadImpl() {
iInsertStatement = iConnection.prepareStatement(
"INSERT INTO IPLookup VALUES( ?,?,? )" );
}
public BadLine addLineImpl( String[] pLine,
int pNumFields, long pLineNumber, String pOriginalLine )
{
MaxMindLineData mmld = new MaxMindLineData( pLine );
if( mmld.valid() ) {
short ci
= ((Short) MaxMindDef.sCountryMap.get(
mmld.getCountryCode() )).shortValue();
iInsertStatement.setLong( 1, mmld.getBeginIPNum() );
iInsertStatement.setLong( 2, mmld.getEndIPNum() );
iInsertStatement.setShort( 3, ci );
iInsertStatement.executeUpdate();
}
return null;
}
public void endLoadImpl() {
iInsertStatement.close();
}
}
PLEASE NOTE: the above code has been condensed for this example
This is our custom code to actually insert the data into the database.
We have used very simple JDBC
database access code for this example. In your own code you will probably use
something more modern like a DataSource or a connection pool.
The INSERT statement is loaded with data and
executed in the addLineImpl method.
We use a utility class, MaxMinLineData, to
convert the text data into the correct data types for our database.
We will cover this class in the next section.
The MaxMindLineData class is a good example of a CSV Manager custom
LineListener. As each line of the CSV file is read, it is inserted directly
into the database. Thus we do not have to load the entire data file into memory.
Converting Strings into Numbers
MaxMindLineData.java
String iCountryCode = pLine[2];
long iBeginIPNum
= Long.parseLong( pLine[3] );
long iEndIPNum
= Long.parseLong( pLine[4] );
The variable pLine refers to the String[] parameter
passed in by MaxMindInserter, as shown above.
The 3rd, 4th and 5th data fields (BeginIPNum, EndIPNum, CountryCode)
are at index 2,3 and 4 in the array, respectively.
The CSV data that we get from MaxMind is provided as plain text data.
That means we have to convert the String value of each data field into the correct
Java data type so that we save it into the database.
This is where we have to be careful. A 4-byte IP address can fit inside a Java int variable,
but all IP Addresses above 127.255.255.255 will be converted into
negative numbers.
Thus we have to store the IP address in a long variable to prevent this.
For the Country Code, we do a quick lookup from our hard coded HashMap in the MaxMindInserter class.
This gives us a short variable.
Lock and Load! Let's Get the Data in the Database!
MaxMindManager.java
DriverManager.registerDriver(
new com.mysql.jdbc.Driver() );
Connection conn = DriverManager.getConnection(
"jdbc:mysql://localhost/testdb",
"testuser", "testpass" );
CsvManager csvman = new CsvManager();
MaxMindInserter mmi = new MaxMindInserter( conn );
csvman.load( pDataFile, mmi );
Again, condensed. For full code, see MaxMindManager.java.
You will have to replace testdb, testuser and testpass with correct values for your own database.
And you will need to use a different database driver if you are not using MySQL.
This code shows how to actually insert the data.
It's only a few lines, as we have done all the hard work already.
CSV Manager of course takes care of the CSV file for us.
How long does the data take to load? There are about 75-80K lines in
the MaxMind CSV file. On a desktop machine, they usually take
about 2 minutes to load fully into the IPLookup
database table. On a high-spec database server,
you'll be an order of magnitude (or more) faster.
If you drop and insert the entire table each time the CSV file changes, this delay can
still be annoying. Later in this article we will present a much faster solution.
So Where Are You From?
MaxMindManager.java
ArrayList sProxyHeaders = new ArrayList();
sProxyHeaders.add("HTTP_X_FORWARDED_FOR");
sProxyHeaders.add("HTTP_FORWARDED");
sProxyHeaders.add("HTTP_CLIENT_IP");
String ipaddr
= pHttpServletRequest.getRemoteAddr();
for( Iterator hI
= sProxyHeaders.iterator();
hI.hasNext(); ) {
String ph = (String) hI.next();
String v = pRequest.getHeader( ph );
if( null != ph
&& null != v && !"".equals( v ) ) {
ipaddr = v;
}
}
The ipaddr variable will contain the most accurate IP address for this visitor.
Alright. Now we have a visitor on our website. We can get their IP address from the
TCP/IP
packets they send in order to request HMTL pages. If they are behind a proxy or firewall, the packet
IP address might be incorrect, so we also have to check the following HTTP headers:
HTTP_X_FORWARDED_FOR
HTTP_FORWARDED
HTTP_CLIENT_IP
Once we have IP Address of our visitor, we convert it to an IP number (we'll look at that in a minute).
Once we have the IP number, we can use this to query the database. We want to identify the database record where
the visitor IP number is in the IP number range assigned to a particular country. We do this by searching for
the row where the visitor IP number is greater-than-or-equal-to the BeginIPNum column,
and less-than-or-equal-to the EndIpNum column.
IP Address to Number conversion
MaxMindManager.java
String[] quads = pIPAddress.split( "\\." );
long ipnum
= (long)16777216*Long.parseLong(quads[0])
+ (long)65536*Long.parseLong(quads[1])
+ (long)256*Long.parseLong(quads[2])
+ (long)1*Long.parseLong(quads[3]);
We are using the regular expression \. to
split the IP address text. We have escaped the dot as it
normally means match any character.
Let's return to the question of how to convert the text of an IP address, a string such as
255.255.255.255, into an IP number, such as 4294967295.
All we have to is take each byte and multiply it by two to the power of the number of bits
we have to shift it left to get it in the right position. Take 128 in the
IP address 255.255.128.255. We need to move it 8 bits left. Or look at it
another way, we need 128 multiplied by 100000000 in binary. So we end up with
128 * 256.
Getting the Country Code from the Database
MaxMindManager.java
String cc = "US"; // default
String ipq = "SELECT CountryCode FROM IPLookup WHERE "
+ "BeginIPNum <= "+ipnum
+ " AND "+ipnum+" <= EndIPNum";
short ci = -1;
Statement st = pConnection.createStatement();
ResultSet rs = st.executeQuery( ipq );
if( rs.next() ) {
ci = rs.getShort(1);
}
if( -1 != ci ) {
cc = MaxMindDef.sCountry[ci];
}
The variable ipnum holds the IP number. The class
MaxMindDef.java
defines the hard-coded Country Codes. The variable cc holds the two letter
Country Code.
Well, this is the biggie. Finally we get to find out which country our
visitor is coming from.
All we have to do is query the database, and request the Country Code for the
IP number range that includes our visitor's IP Number.
Then we dereference the country index in our hard-coded list of Country Code's to get the
two-letter Country Code.
What About the Monthly Updates?
So we're pretty much done. Or are we? The IP address to country table is not
static. In fact, it changes all the time. MaxMind publishes an update to the free
version every month, and an update to the higher-accuracy commercial version every week.
So every week, or every month, you have to delete the old table from your database,
and import the new table. That's nice. Or rather, it's not very nice at all, since your IP lookup
code will fail during this update. If you have high visitor volume this is not really what you want.
Even if you have low visitor volume, it's still bound to happen just when an important buyer is on your site
(trust me, I've been there).
So what to do? Well, it would be much better to take each update, figure out which records are new, and
which records are old, add the new records, delete the old ones, and keep the database table working.
So that's what we're going to do.
An Nice Algorithm for Updates
To find out which records are new and which are old, we have to
compare the new IP address table with the old table we have in the
database. The easy way to do this is just to go through each table,
and for the new table, add anything not in the old, and for the old,
delete anything not in the new. Unfortunately, that approach is very
wasteful of resources and will take longer than simply deleting and
loading the entire new table.
Instead, let's look at the data again. It's just numbers, even the
country code. So we can append them all together to get a really big
number. We have 4 bytes each for the start and end IP numbers, and 2
bytes for the Country Code. So that gives us a 10 byte number. This
really big number defines the IP address range to country mapping
entirely and it is all we need.
Sorted!
So how do we use these numbers? Well, for starters, we sort
them. Luckily, MaxMind has already sorted the CSV file, so long
as we build our big number in the following order:
BeginIPNum, EndIPNum,
CountryCode. We can use the database to sort the old
table. In fact we only need to sort by BeginIPNum, since
we know this is unique (Otherwise the MaxMind CSV file
would be wrong).
Now we have two sorted sets of numbers, and we can apply a very simple
batch processing algorithm to them: starting from the first position
in both lists, keep moving to the next position so long as the numbers
at the current position are equal (that is, we don't need to add or
delete them, they have stayed the same). If the number in the new list
is less than the number in the old, then we know we have to add it -
it must be missing as the old list has already passed over
it. Similarly, if the new number is greater than the old number, then
delete the number in the old list - it can't be in the new list
because we've already passed over it.
Yeah, But That's A Lot of Data
MaxMindLineListener.java
public class MaxMindLineListener
extends LineListenerSupport {
protected long[] iRanges;
protected short[] iCountries;
protected int iNumLines;
protected int iIndex;
public MaxMindLineListener( int pNumLines ) {
iNumLines = pNumLines;
}
public void startLoadImpl() {
iRanges = new long[iNumLines];
iCountries = new short[iNumLines];
iIndex = 0;
}
public BadLine addLineImpl( String[] pLine,
int pNumFields, long pLineNumber,
String pOriginalLine ) {
long start = Long.parseLong( pLine[2] );
long end = Long.parseLong( pLine[3] );
short ci
= ((Short) MaxMindDef.sCountryMap.get(
pLine[4])).shortValue();
iRanges[iIndex] = (start << 32) | end;
iCountries[iIndex] = ci;
iIndex++;
return null;
}
}
The variable iNumLines is the estimated maximum number of lines in the
new CSV file. Luckily this is quite easy to work out, as the average minimum length
is probably a bit less than 64 characters (it's easy to see this by inspecting the CSV file).
To estimate the maximum number of lines, we just divide the CSV file size by 64.
It is. We need to get the new data in the latest CSV file,
and the old data in the database table into a common format in
main memory so that we can run the comparison algorithm.
Now, we can't store a 10 byte integer natively in java, but we certainly
want to use native operations, as they are the fastest. So we'll split our number into
two parts: a long and a short. We can then have two
arrays, one long[] and one short[] array, for each data table.
If we assume that each table has 100K rows (in reality we have about 75-80k), then we require
about 1.6Mb of main memory: 2 * (8+2)*100000 / (1024*1024). By default, the
Java virtual machine starts with 64Mb, so we should be OK.
We will store two IP numbers (4 bytes) in one long (8 bytes).
First we shift the start IP number 32 bits (4 bytes) left.
Then we OR it with the end number.
We'll have to perform similar trickery to get the IP numbers out again.
We use shorts for the countries to allow for additional values in future.
We use the MaxMindLineListener to load the data in the
latest CSV file and store it in the long and short arrays.
Compare and Contrast
MaxMindManager.java
public void compare(
long[] pNewRanges, short[] pNewCountries, int pNewMax,
long[] pOldRanges, short[] pOldCountries, int pOldMax,
ArrayList pAdd, ArrayList pDelete ) {
long[] nr = pNewRanges; short[] nc = pNewCountries;
long[] or = pOldRanges; short[] oc = pOldCountries;
int nmax = pNewMax; int omax = pOldMax;
// ns: new start ip, ne: new end ip, nk: new country
// os: old start ip, oe: old end ip, ok: old country
long ns, ne, os, oe; short nk, ok; int nI, oI;
while( nI < nmax && oI < omax ) {
ns = nr[nI] >>> 32; ne = nr[nI] & 0xFFFFFFFFL;
nk = nc[nI];
os = or[oI] >>> 32; oe = or[oI] & 0xFFFFFFFFL;
ok = oc[oI];
if( ns == os && ne == oe && nk == ok ) {
nI++; oI++;
}
else if( ns < os || ( ns == os && ne < oe )
|| ( ns == os && ne == oe && nk < ok ) ) {
pAdd.add( new Record( ns, ne, nk ) );
nI++;
}
else if( ns > os || ( ns == os && ne > oe )
|| ( ns == os && ne == oe && nk > ok ) ) {
pDelete.add( new Record( os, oe, ok ) );
oI++;
}
}
if( nI < nmax ) {
for( ; nI < nmax; nI++ ) {
ns = nr[nI] >>> 32; ne = nr[nI] & 0xFFFFFFFFL;
nk = nc[nI];
pAdd.add( new Record( ns, ne, nk ) );
}
}
if( oI < omax ) {
for( ; oI < omax; oI++ ) {
os = or[oI] >>> 32; oe = or[oI] & 0xFFFFFFFFL;
ok = oc[oI];
pDelete.add( new Record( os, oe, ok ) );
}
}
}
This code has been heavily condensed, see the
MaxMindManager source code
for the full version. The Record class refers to a simple internal class for
holding the start, end and country values;
The old data from the database table can be pulled out with a simple query
and placed into similar long and short arrays.
Once we have the new and old data in the same data structures, we can run the
comparison algorithm. This algorithm will return a list of new records that we have to
add to the database, and a list of old records that we have to delete.
The key question when implementing the algorithm is, how do we compare the 10 byte numbers?
Well, it's actually quite easy - we just compare the parts. Since we have
decided on the order BeginIPNum, EndIPNum, CountryCode,
we just need to compare the BeginIPNums with each other, and then the EndIPNums
and finally the CountryCodes.
You can see how this works in the comparison method code opposite. But there's one more catch.
We still need to get valid numbers to compare with. We have to carefully extract the
two ints that we have stored in a long, making sure that we don't end up with negative values.
That's why we use the >>> operator (right shift with zeros) for the start IP number,
and mask off the end IP number with 0xFFFFFFFFL (the L suffix denoting a long value).
Note that we also use two loops at the end to handle values
at the end of each list that are greater than the last value
of the other list.
The Additions and Deletions
After all that bit fiddling, doing the actual inserts and deletes is nice and
straight forward. We construct INSERT statements for the Records in the add list,
and DELETE statements for the Records in the delete list.
Our example code in MaxMindManager
shows how to do this in the insertRecords and deleteRecords methods.
The approach taken is very simplistic and you will probably want to convert the code
to the preferred data access mechanism of your application. Certainly, it would be nice to make
the full update into a single transaction.
We have also included a test class
MaxMindTest,
so that you can see how to use the MaxMindManager class from client code.
This test class requires JUnit in order to compile and run.
The Source Code
The source code to fully implement all the functionality described in
this article is made freely available under a BSD-style
license. This means that you can freely use it for commercial or
private use, without any restrictions.
All the required classes are listed opposite, linked to a downloadable
text file of the source code.
| |