| Copyright | (c) 2015 Ben Hamilton |
|---|---|
| License | BSD-style |
| Maintainer | bgertzfield@gmail.com |
| Stability | experimental |
| Portability | GHC |
| Safe Haskell | None |
| Language | Haskell98 |
Data.Text.ICU.Spoof
Description
Synopsis
- data MSpoof
- data OpenFromSourceParseError = OpenFromSourceParseError {
- errFile :: OpenFromSourceParseErrorFile
- parseError :: ParseError
- data SpoofCheck
- data SpoofCheckResult
- data RestrictionLevel
- data SkeletonTypeOverride
- open :: IO MSpoof
- openFromSerialized :: ByteString -> IO MSpoof
- openFromSource :: (ByteString, ByteString) -> IO MSpoof
- getSkeleton :: MSpoof -> Maybe SkeletonTypeOverride -> Text -> IO Text
- getChecks :: MSpoof -> IO [SpoofCheck]
- setChecks :: MSpoof -> [SpoofCheck] -> IO ()
- getRestrictionLevel :: MSpoof -> IO (Maybe RestrictionLevel)
- setRestrictionLevel :: MSpoof -> RestrictionLevel -> IO ()
- getAllowedLocales :: MSpoof -> IO [String]
- setAllowedLocales :: MSpoof -> [String] -> IO ()
- areConfusable :: MSpoof -> Text -> Text -> IO SpoofCheckResult
- spoofCheck :: MSpoof -> Text -> IO SpoofCheckResult
- serialize :: MSpoof -> IO ByteString
Unicode spoof checking API
The spoofCheck, areConfusable, and getSkeleton functions analyze
Unicode text for visually confusable (or "spoof") characters.
For example, Latin, Cyrillic, and Greek all contain unique Unicode values which appear nearly identical on-screen:
A 0041 LATIN CAPITAL LETTER A
Α 0391 GREEK CAPITAL LETTER ALPHA
А 0410 CYRILLIC CAPITAL LETTER A
Ꭺ 13AA CHEROKEE LETTER GO
ᴀ 1D00 LATIN LETTER SMALL CAPITAL A
ᗅ 15C5 CANADIAN SYLLABICS CARRIER GHO
A FF21 FULLWIDTH LATIN CAPITAL LETTER A
𐊠 102A0 CARIAN LETTER A
𝐀 1D400 MATHEMATICAL BOLD CAPITAL A
and so on. To check a string for visually confusable characters:
openanMSpoof- optionally configure it with
setChecks,setRestrictionLevel, and/orsetAllowedLocales, then spoofChecka single string, useareConfusableto check if two strings could be confused for each other, or usegetSkeletonto precompute a "skeleton" string (similar to a hash code) which can be cached and re-used to quickly check (using Unicode string comparison) if two strings are confusable.
By default, these methods will use ICU's bundled copy of
confusables.txt
and confusablesWholeScript.txt,
which could be out of date. To provide your own confusables databases, use
openFromSource. (To avoid repeatedly parsing these databases, you
can then serialize your configured MSpoof and later
openFromSerialized to load the pre-parsed databases.)
Configurable spoof checker wrapping an opaque handle and optionally wrapping a previously serialized instance.
data OpenFromSourceParseError #
Exception thrown with openFromSource fails to parse one of the input files.
Constructors
| OpenFromSourceParseError | |
Fields
| |
Instances
| Show OpenFromSourceParseError # | |
Defined in Data.Text.ICU.Spoof Methods showsPrec :: Int -> OpenFromSourceParseError -> ShowS # show :: OpenFromSourceParseError -> String # showList :: [OpenFromSourceParseError] -> ShowS # | |
| Exception OpenFromSourceParseError # | |
Defined in Data.Text.ICU.Spoof | |
| NFData OpenFromSourceParseError # | |
Defined in Data.Text.ICU.Spoof Methods rnf :: OpenFromSourceParseError -> () # | |
data SpoofCheck #
Constructors
| SingleScriptConfusable | Makes |
| MixedScriptConfusable | Makes Makes |
| WholeScriptConfusable | Makes |
| AnyCase | By default, spoof checks assume the strings have been processed
through |
| RestrictionLevel | Checks that identifiers are no looser than the specified
level passed to |
| Invisible | Checks the identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. |
| CharLimit | Checks whether the identifier contains only characters from a
specified set (for example, via |
| MixedNumbers | Checks that the identifier contains numbers from only a single script. |
| AllChecks | Enables all checks. |
| AuxInfo | Enables returning a |
Instances
| Bounded SpoofCheck # | |
Defined in Data.Text.ICU.Spoof | |
| Enum SpoofCheck # | |
Defined in Data.Text.ICU.Spoof Methods succ :: SpoofCheck -> SpoofCheck # pred :: SpoofCheck -> SpoofCheck # toEnum :: Int -> SpoofCheck # fromEnum :: SpoofCheck -> Int # enumFrom :: SpoofCheck -> [SpoofCheck] # enumFromThen :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromTo :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromThenTo :: SpoofCheck -> SpoofCheck -> SpoofCheck -> [SpoofCheck] # | |
| Eq SpoofCheck # | |
Defined in Data.Text.ICU.Spoof | |
| Show SpoofCheck # | |
Defined in Data.Text.ICU.Spoof Methods showsPrec :: Int -> SpoofCheck -> ShowS # show :: SpoofCheck -> String # showList :: [SpoofCheck] -> ShowS # | |
data SpoofCheckResult #
Constructors
| CheckOK | The string passed all configured spoof checks. |
| CheckFailed [SpoofCheck] | The string failed one or more spoof checks. |
| CheckFailedWithRestrictionLevel | The string failed one or more spoof checks, and failed to pass the configured restriction level. |
Fields
| |
Instances
| Eq SpoofCheckResult # | |
Defined in Data.Text.ICU.Spoof Methods (==) :: SpoofCheckResult -> SpoofCheckResult -> Bool # (/=) :: SpoofCheckResult -> SpoofCheckResult -> Bool # | |
| Show SpoofCheckResult # | |
Defined in Data.Text.ICU.Spoof Methods showsPrec :: Int -> SpoofCheckResult -> ShowS # show :: SpoofCheckResult -> String # showList :: [SpoofCheckResult] -> ShowS # | |
data RestrictionLevel #
Constructors
| ASCII | Checks that the string contains only Unicode values in the range ߝ inclusive. |
| SingleScriptRestrictive | Checks that the string contains only characters from a single script. |
| HighlyRestrictive | Checks that the string contains only characters from a single script, or from the combinations (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
| ModeratelyRestrictive | Checks that the string contains only characters from the combinations (Latin + Cyrillic + Greek + Cherokee), (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
| MinimallyRestrictive | Allows arbitrary mixtures of scripts. |
| Unrestrictive | Allows any valid identifiers, including characters outside of the Identifier Profile. |
Instances
| Bounded RestrictionLevel # | |
Defined in Data.Text.ICU.Spoof | |
| Enum RestrictionLevel # | |
Defined in Data.Text.ICU.Spoof Methods succ :: RestrictionLevel -> RestrictionLevel # pred :: RestrictionLevel -> RestrictionLevel # toEnum :: Int -> RestrictionLevel # fromEnum :: RestrictionLevel -> Int # enumFrom :: RestrictionLevel -> [RestrictionLevel] # enumFromThen :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromTo :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromThenTo :: RestrictionLevel -> RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # | |
| Eq RestrictionLevel # | |
Defined in Data.Text.ICU.Spoof Methods (==) :: RestrictionLevel -> RestrictionLevel -> Bool # (/=) :: RestrictionLevel -> RestrictionLevel -> Bool # | |
| Show RestrictionLevel # | |
Defined in Data.Text.ICU.Spoof Methods showsPrec :: Int -> RestrictionLevel -> ShowS # show :: RestrictionLevel -> String # showList :: [RestrictionLevel] -> ShowS # | |
data SkeletonTypeOverride #
Constructors
| SkeletonSingleScript | By default, |
| SkeletonAnyCase | By default, |
Instances
Functions
Open a spoof checker for checking Unicode strings for lookalike
security issues with default options (all SpoofChecks except
CharLimit).
openFromSerialized :: ByteString -> IO MSpoof #
openFromSource :: (ByteString, ByteString) -> IO MSpoof #
Open a spoof checker with custom rules given the UTF-8 encoded
contents of the confusables.txt and confusablesWholeScript.txt
files as described in Unicode UAX #39.
getSkeleton :: MSpoof -> Maybe SkeletonTypeOverride -> Text -> IO Text #
Generates re-usable "skeleton" strings which can be used (via Unicode equality) to check if an identifier is confusable with some large set of existing identifiers.
If you cache the returned strings in storage, you must invalidate your cache any time the underlying confusables database changes (i.e., on ICU upgrade).
By default, assumes all input strings have been passed through
toCaseFold and are lower-case. To change this, pass
SkeletonAnyCase.
By default, builds skeletons which catch visually confusable
characters across multiple scripts. Pass SkeletonSingleScript to
override that behavior and build skeletons which catch visually
confusable characters across single scripts.
getChecks :: MSpoof -> IO [SpoofCheck] #
Get the checks performed by a spoof checker.
setChecks :: MSpoof -> [SpoofCheck] -> IO () #
Configure the checks performed by a spoof checker.
getRestrictionLevel :: MSpoof -> IO (Maybe RestrictionLevel) #
Get the restriction level of a spoof checker.
setRestrictionLevel :: MSpoof -> RestrictionLevel -> IO () #
Configure the restriction level of a spoof checker.
getAllowedLocales :: MSpoof -> IO [String] #
Get the list of locale names allowed to be used with a spoof checker.
(We don't use LocaleName since the root and default locales have no
meaning here.)
setAllowedLocales :: MSpoof -> [String] -> IO () #
Get the list of locale names allowed to be used with a spoof checker.
(We don't use LocaleName since the root and default locales have no
meaning here.)
areConfusable :: MSpoof -> Text -> Text -> IO SpoofCheckResult #
Check if two strings could be confused with each other.
spoofCheck :: MSpoof -> Text -> IO SpoofCheckResult #
Checks if a string could be confused with any other.
serialize :: MSpoof -> IO ByteString #
Serializes the rules in this spoof checker to a byte array,
suitable for re-use by openFromSerialized.
Only includes any data provided to openFromSource. Does not
include any other state or configuration.